Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to generate dummy variable in stata

    Hi Statalist,

    I have a data set containg many variables, among them one is a multiple response variable (called rank1) containing the ranks from 0 to 11 separated by space and other multiple response variable (called rank2) containing the ranks from 0 to 11 unseparated (without space or comma etc), like

    rank1
    0 1 10 2 11 ' ' '

    rank2

    0110211 ' ' '

    I want to generate a dummy variable for each rank in both multiple response variables and tried regexm but failed. I don't know how to proceed. Any help will be highly appreciated
    Ashish


    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str11 rank1 float rank2
    "0 1 10 2 11" 110211
    "1 11 0 9" 11109
    "2 9 11 10" 291110
    end

  • #2
    Here are two approaches that may start you in a useful direction.
    Code:
    generate id = _n
    split rank1, generate(resp) destring
    reshape long resp, i(id) j(seq)
    drop if resp==.
    drop seq
    generate res = 1
    reshape wide res, i(id) j(resp)
    mvencode res*, mv(.=0)
    list, clean noobs
    Code:
    . generate id = _n
    
    . split rank1, generate(resp) destring
    variables born as string:
    resp1  resp2  resp3  resp4  resp5
    resp1: all characters numeric; replaced as byte
    resp2: all characters numeric; replaced as byte
    resp3: all characters numeric; replaced as byte
    resp4: all characters numeric; replaced as byte
    resp5: all characters numeric; replaced as byte
    (2 missing values generated)
    
    . reshape long resp, i(id) j(seq)
    (j = 1 2 3 4 5)
    
    Data                               Wide   ->   Long
    -----------------------------------------------------------------------------
    Number of observations                3   ->   15          
    Number of variables                   8   ->   5          
    j variable (5 values)                     ->   seq
    xij variables:
                      resp1 resp2 ... resp5   ->   resp
    -----------------------------------------------------------------------------
    
    . drop if resp==.
    (2 observations deleted)
    
    . drop seq
    
    . generate res = 1
    
    . reshape wide res, i(id) j(resp)
    (j = 0 1 2 9 10 11)
    
    Data                               Long   ->   Wide
    -----------------------------------------------------------------------------
    Number of observations               13   ->   3          
    Number of variables                   5   ->   9          
    j variable (6 values)              resp   ->   (dropped)
    xij variables:
                                        res   ->   res0 res1 ... res11
    -----------------------------------------------------------------------------
    
    . mvencode res*, mv(.=0)
            res0: 1 missing value recoded
            res1: 1 missing value recoded
            res2: 1 missing value recoded
            res9: 1 missing value recoded
           res10: 1 missing value recoded
    
    . list, clean noobs
    
        id   res0   res1   res2   res9   res10   res11         rank1    rank2  
         1      1      1      1      0       1       1   0 1 10 2 11   110211  
         2      1      1      0      1       0       1      1 11 0 9    11109  
         3      0      0      1      1       1       1     2 9 11 10   291110
    Code:
    forvalues r=0/11 {
        generate res`r' = ustrregexm(rank1,"\b`r'\b")
    }
    list, clean noobs
    Code:
    . forvalues r=0/11 {
      2.     generate res`r' = ustrregexm(rank1,"\b`r'\b")
      3. }
    
    . list, clean noobs
    
              rank1    rank2   res0   res1   res2   res3   res4   res5   res6   res7   res8   res9   res10   res11  
        0 1 10 2 11   110211      1      1      1      0      0      0      0      0      0      0       1       1  
           1 11 0 9    11109      1      1      0      0      0      0      0      0      0      1       0       1  
          2 9 11 10   291110      0      0      1      0      0      0      0      0      0      1       1       1
    The second approach use the Unicode regular expression functions introduced in Stata 14, which have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

    A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

    Comment


    • #3
      Hi@william,
      I tried your code it is working fine to me for rank1 variable, but, it fails for rank2 variable that not having any space. You may think rank1 and rank2 are same except rank1 has space but not in rank2.
      Anticipating your help.
      Ashish

      Comment


      • #4
        If I understand correctly, rank2 is just rank1 mangled, and uselessly so as 1 and 2 juxtaposed could be 12 and 1112 juxtaposed could be 1 and 11 and 2 or 11 and 12 or 11 and 1 and 2. Why bother with it?

        Comment


        • #5
          Worse: the implication of #1 is that 0 as first item disappeared.

          Comment


          • #6
            Nick Cox expresses the reason I ignored rank2. It is ambiguous and hence useless. Creating rank2 from rank1 by using destring and telling it to ignore spaces was a mistake.

            Comment


            • #7
              William Lisowski I agree with you that rank2 is a numerical form of rank1, but, we may have such a case where numbers are placed without space or other delimiters. Let us pretend rank2 is a string variable having no space in between in its observation. Is there any way to split it?

              @ Nick Cox Thanks for your reply.

              Ashish.

              Comment


              • #8
                No, because there is no way to know if "12" is meant to be 12 or 1 and 2, etc., as Nick Cox describes in post 4.

                Comment


                • #9
                  Even in the example in #1 there was moderate lawlessness as some values included 4 integers and some 5. Also, somehow ranks like 9, 10 and 11 make sense even with 4 or 5 items. Also, somehow rank 0 makes sense.

                  Perhaps this makes perfect sense for your data rules, Perhaps you just made up examples but it is still true that real or realistic examples are best.

                  The bottom line is that spaces separating single characters are disposable as 123456789 is easily seen to be equivalent to 1 2 3 4 5 6 7 8 9. But all bets are off once different solutions are possible in decoding.

                  Comment


                  • #10
                    Nick Cox Yes the reason for different values is due to the presence of -99 (No response) which was later cleaned up as we don't want to keep such values. Thanks for the detailed clarification. I think I have made mistake in the data collection tool by not assigning any delimiter. Thank you

                    Comment

                    Working...
                    X