How to generate dummy variable in stata

Ashish Bandhu

Join Date: Oct 2020

Posts: 40
#1

How to generate dummy variable in stata

18 Feb 2022, 23:22

Hi Statalist,

I have a data set containg many variables, among them one is a multiple response variable (called rank1) containing the ranks from 0 to 11 separated by space and other multiple response variable (called rank2) containing the ranks from 0 to 11 unseparated (without space or comma etc), like

rank1
0 1 10 2 11 ' ' '

rank2

0110211 ' ' '

I want to generate a dummy variable for each rank in both multiple response variables and tried regexm but failed. I don't know how to proceed. Any help will be highly appreciated
Ashish

* Example generated by -dataex-. To install: ssc install dataex
clear
input str11 rank1 float rank2
"0 1 10 2 11" 110211
"1 11 0 9" 11109
"2 9 11 10" 291110
end
Tags: string, syntax

William Lisowski

Join Date: Dec 2014
Posts: 10150

19 Feb 2022, 09:09

Here are two approaches that may start you in a useful direction.

Code:

generate id = _n
split rank1, generate(resp) destring
reshape long resp, i(id) j(seq)
drop if resp==.
drop seq
generate res = 1
reshape wide res, i(id) j(resp)
mvencode res*, mv(.=0)
list, clean noobs

Code:

. generate id = _n

. split rank1, generate(resp) destring
variables born as string:
resp1  resp2  resp3  resp4  resp5
resp1: all characters numeric; replaced as byte
resp2: all characters numeric; replaced as byte
resp3: all characters numeric; replaced as byte
resp4: all characters numeric; replaced as byte
resp5: all characters numeric; replaced as byte
(2 missing values generated)

. reshape long resp, i(id) j(seq)
(j = 1 2 3 4 5)

Data                               Wide   -&gt;   Long
-----------------------------------------------------------------------------
Number of observations                3   -&gt;   15          
Number of variables                   8   -&gt;   5          
j variable (5 values)                     -&gt;   seq
xij variables:
                  resp1 resp2 ... resp5   -&gt;   resp
-----------------------------------------------------------------------------

. drop if resp==.
(2 observations deleted)

. drop seq

. generate res = 1

. reshape wide res, i(id) j(resp)
(j = 0 1 2 9 10 11)

Data                               Long   -&gt;   Wide
-----------------------------------------------------------------------------
Number of observations               13   -&gt;   3          
Number of variables                   5   -&gt;   9          
j variable (6 values)              resp   -&gt;   (dropped)
xij variables:
                                    res   -&gt;   res0 res1 ... res11
-----------------------------------------------------------------------------

. mvencode res*, mv(.=0)
        res0: 1 missing value recoded
        res1: 1 missing value recoded
        res2: 1 missing value recoded
        res9: 1 missing value recoded
       res10: 1 missing value recoded

. list, clean noobs

    id   res0   res1   res2   res9   res10   res11         rank1    rank2  
     1      1      1      1      0       1       1   0 1 10 2 11   110211  
     2      1      1      0      1       0       1      1 11 0 9    11109  
     3      0      0      1      1       1       1     2 9 11 10   291110

Code:

forvalues r=0/11 {
    generate res`r' = ustrregexm(rank1,"\b`r'\b")
}
list, clean noobs

Code:

. forvalues r=0/11 {
  2.     generate res`r' = ustrregexm(rank1,"\b`r'\b")
  3. }

. list, clean noobs

          rank1    rank2   res0   res1   res2   res3   res4   res5   res6   res7   res8   res9   res10   res11  
    0 1 10 2 11   110211      1      1      1      0      0      0      0      0      0      0       1       1  
       1 11 0 9    11109      1      1      0      0      0      0      0      0      0      1       0       1  
      2 9 11 10   291110      0      0      1      0      0      0      0      0      0      1       1       1

The second approach use the Unicode regular expression functions introduced in Stata 14, which have a much more powerful definition of regular expressions than the non-Unicode functions. In the Statlist post linked here we are told that Stata's Unicode regular expression parser is the ICU regular expression engine documented here. A comprehensive discussion of regular expressions can be found here.

A good introduction to Stata's Unicode regular expression functions is given by Asjad Naqvi at The Stata Guide. Hua Peng (StataCorp) provides additional examples of advanced techniques in his github blog.

Comment

Ashish Bandhu

Join Date: Oct 2020

Posts: 40
#3

20 Feb 2022, 03:26

Hi@william,
I tried your code it is working fine to me for rank1 variable, but, it fails for rank2 variable that not having any space. You may think rank1 and rank2 are same except rank1 has space but not in rank2.
Anticipating your help.
Ashish
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#4

20 Feb 2022, 04:08

If I understand correctly, rank2 is just rank1 mangled, and uselessly so as 1 and 2 juxtaposed could be 12 and 1112 juxtaposed could be 1 and 11 and 2 or 11 and 12 or 11 and 1 and 2. Why bother with it?
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#5

20 Feb 2022, 07:44

Worse: the implication of #1 is that 0 as first item disappeared.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

20 Feb 2022, 07:48

Nick Cox expresses the reason I ignored rank2. It is ambiguous and hence useless. Creating rank2 from rank1 by using destring and telling it to ignore spaces was a mistake.
1 like
Comment
Ashish Bandhu

Join Date: Oct 2020

Posts: 40
#7

21 Feb 2022, 03:30

William Lisowski I agree with you that rank2 is a numerical form of rank1, but, we may have such a case where numbers are placed without space or other delimiters. Let us pretend rank2 is a string variable having no space in between in its observation. Is there any way to split it?

@ Nick Cox Thanks for your reply.

Ashish.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

21 Feb 2022, 06:09

No, because there is no way to know if "12" is meant to be 12 or 1 and 2, etc., as Nick Cox describes in post 4.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35211
#9

21 Feb 2022, 06:22

Even in the example in #1 there was moderate lawlessness as some values included 4 integers and some 5. Also, somehow ranks like 9, 10 and 11 make sense even with 4 or 5 items. Also, somehow rank 0 makes sense.

Perhaps this makes perfect sense for your data rules, Perhaps you just made up examples but it is still true that real or realistic examples are best.

The bottom line is that spaces separating single characters are disposable as 123456789 is easily seen to be equivalent to 1 2 3 4 5 6 7 8 9. But all bets are off once different solutions are possible in decoding.
1 like
Comment
Ashish Bandhu

Join Date: Oct 2020

Posts: 40
#10

21 Feb 2022, 07:36

Nick Cox Yes the reason for different values is due to the presence of -99 (No response) which was later cleaned up as we don't want to keep such values. Thanks for the detailed clarification. I think I have made mistake in the data collection tool by not assigning any delimiter. Thank you
Comment

Announcement