Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keeping consistent values across variables with -encode-

    Hey all again, trying to see again if there's a clever solution to my goal. Given the length of the strings and potentially sensitive information, I'm omitting -dataex- and making a brief example to explain.

    I have 3 string variables I'd like to encode. The strings are all taken from the string list, however not every variable has every potential value for the string, leading to a dataset like this:

    Code:
    clear
    input str1 (var1 var2 var3)
    "a" "a" "a"
    "b" "a" "b"
    "c" "c" "b"
    "a" "d" "b"
    end
    Now, I'd like to use -encode- so that a = 1 , b = 2 c = 3 and so on for all variables. However as each variable has a different number of values, the encoding will not be consistent across variables with repeated -encodes-. Is there a solution to ensure consistency? I'm going to experiment with -reshape- to solve this, but I'm curious to see if anyone else has had this issue

    Thanks!
    Last edited by Jacob Levine; 22 Feb 2018, 09:33.

  • #2
    I believe I've written my own solution using -reshape-. Sharing if anyone else finds this issue

    Code:
    reshape long var, i(id) j(number)
    encode var, g(newvar)
    drop var
    reshape wide newvar, i(id) j(number)
    "id" is a unique observation identifier, number is an arbitrary variable created in -reshape-

    Hope this helps someone in the future, and if you have a different way, feel free to share!

    Comment


    • #3
      Fine to omit dataex, for the reasons that you give, but please provide examples that work next time you post. You

      (1) cannot use string as a variable type with input
      (2) cannot omit double quotes around string values with input when strings contain embedded spaces; see #5 below
      (3) cannot omit the end statement after input

      That said, see multencode (Cox, SSC) to solve the problem

      Code:
      clear
      input str1 (var1 var2 var3)
      "a" "a" "a"
      "b" "a" "b"
      "c" "c" "b"
      "a" "d" "b"
      end
      
      multencode var1-var3 , generate(enc_var1-enc_var3)
      
      list
      label list
      gives

      Code:
      . input str1 (var1 var2 var3)
      
                var1       var2       var3
        1. "a" "a" "a"
        2. "b" "a" "b"
        3. "c" "c" "b"
        4. "a" "d" "b"
        5. end
      
      .
      . multencode var1-var3 , generate(enc_var1-enc_var3)
      
      .
      . list
      
           +-----------------------------------------------------+
           | var1   var2   var3   enc_var1   enc_var2   enc_var3 |
           |-----------------------------------------------------|
        1. |    a      a      a          a          a          a |
        2. |    b      a      b          b          a          b |
        3. |    c      c      b          c          c          b |
        4. |    a      d      b          a          d          b |
           +-----------------------------------------------------+
      
      . label list
      var1:
                 1 a
                 2 b
                 3 c
                 4 d
      Best
      Daniel
      Last edited by daniel klein; 22 Feb 2018, 09:54.

      Comment


      • #4
        example edited, thanks for the one command solution

        Comment


        • #5
          Albeit reluctantly, I want to point out, in regard to post #5, that double quotes around string values are optional with input for those strings that contain no embedded blanks or "special characters" (which the Stata Data Management Reference Manual PDF apparently does not clarify further in the documentation for input). Note the second example in the output of help input where the three values for name are surrounded in double quotes but two of the values for sex are not, and one is, even though it contains no embedded space.

          The best way to prepare made-up data for sharing on Statalist is to use Stata's Data Editor window to create the made-up data in Stata's memory, and then in Stata's Command window apply dataex to create the listing to be copied and pasted into the Statalist post.

          Comment


          • #6
            Originally posted by William Lisowski View Post
            I want to point out [... ] that double quotes around string values are optional with input for those strings that contain no embedded blanks or "special characters"
            I was not aware of this. Thanks for pointing it out. I have edited my earlier post.

            Best
            Daniel

            Comment

            Working...
            X