How can i split a variable?

Tesky Koba

Join Date: Mar 2020

Posts: 17
#1

How can i split a variable?

14 Mar 2020, 17:34

Hello everybody

I have a problem I am looking for how to solve it. In my database I have a variable that contains several observations that I want to split and that each
observation becomes a new binary variable to say 1 for yes and 0 for no. Attach the screenshot.
thank you for helping me
Attached Files
Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Mar 2020, 18:12

Your picture is not useful for testing code, so the example below uses invented data and you will learn from it and modify it to solve your problem.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int id str15 names
101 "Alice Bob"      
102 "Alice Chris"    
103 "Bob Alice"      
104 "Alice Bob Chris"
105 "Fred"          
end
split names, generate(name)
list, clean
reshape long name, i(id) j(j)
drop if missing(name)
drop j
list, clean
generate byte value = 1
reshape wide value, i(id) j(name) string
list, clean abbreviate(12)
mvencode value*, mv(0)
rename (value*) (*)
list, clean

Code:

. split names, generate(name)
variables created as string:
name1  name2  name3

. list, clean

        id             names   name1   name2   name3  
  1.   101         Alice Bob   Alice     Bob          
  2.   102       Alice Chris   Alice   Chris          
  3.   103         Bob Alice     Bob   Alice          
  4.   104   Alice Bob Chris   Alice     Bob   Chris  
  5.   105              Fred    Fred                  

. reshape long name, i(id) j(j)
(note: j = 1 2 3)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                        5   ->      15
Number of variables                   5   ->       4
j variable (3 values)                     ->   j
xij variables:
                      name1 name2 name3   ->   name
-----------------------------------------------------------------------------

. drop if missing(name)
(5 observations deleted)

. drop j

. list, clean

        id             names    name  
  1.   101         Alice Bob   Alice  
  2.   101         Alice Bob     Bob  
  3.   102       Alice Chris   Alice  
  4.   102       Alice Chris   Chris  
  5.   103         Bob Alice     Bob  
  6.   103         Bob Alice   Alice  
  7.   104   Alice Bob Chris   Alice  
  8.   104   Alice Bob Chris     Bob  
  9.   104   Alice Bob Chris   Chris  
 10.   105              Fred    Fred  

. generate byte value = 1

. reshape wide value, i(id) j(name) string
(note: j = Alice Bob Chris Fred)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                       10   ->       5
Number of variables                   4   ->       6
j variable (4 values)              name   ->   (dropped)
xij variables:
                                  value   ->   valueAlice valueBob ... valueFred
-----------------------------------------------------------------------------

. list, clean abbreviate(12)

        id   valueAlice   valueBob   valueChris   valueFred             names  
  1.   101            1          1            .           .         Alice Bob  
  2.   102            1          .            1           .       Alice Chris  
  3.   103            1          1            .           .         Bob Alice  
  4.   104            1          1            1           .   Alice Bob Chris  
  5.   105            .          .            .           1              Fred  

. mvencode value*, mv(0)
  valueAlice: 1 missing value recoded
    valueBob: 2 missing values recoded
  valueChris: 3 missing values recoded
   valueFred: 4 missing values recoded

. rename (value*) (*)

. list, clean

        id   Alice   Bob   Chris   Fred             names  
  1.   101       1     1       0      0         Alice Bob  
  2.   102       1     0       1      0       Alice Chris  
  3.   103       1     1       0      0         Bob Alice  
  4.   104       1     1       1      0   Alice Bob Chris  
  5.   105       0     0       0      1              Fred

To improve the quality of your future posts, please now take a few moments to review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question. It's particularly helpful to copy commands and output from your Stata Results window and paste them into your Statalist post using code delimiters [CODE] and [/CODE], and to use the dataex command to provide sample data, as described in section 12 of the FAQ.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

Added afterwards: I see now that some of your data contains names that are not likely to work as Stata variable names. Perhaps after the first reshape you can include a acommand that will alter the inappropriate values. Perhaps something like this will help

Code:

replace name = ustrtoname(name)

although I have not tested it on your data because I do not have your data.

Another reason to provide usable example data with your question.

Last edited by William Lisowski; 14 Mar 2020, 18:19.

Comment

Tesky Koba

Join Date: Mar 2020

Posts: 17
#3

15 Mar 2020, 05:12

Hello William

Thanks a lot for your help.
I take note.
Comment
Tesky Koba

Join Date: Mar 2020

Posts: 17
#4

15 Mar 2020, 16:08

Hello William

please what is the Id and J in this piece of code .reshape long name, i (id) j (j)? I can't quite understand it and that's where I'm stuck.
thank you for helping me
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

15 Mar 2020, 16:25

Have you read the output of help reshape?

The i(id) option specifies that the existing variable id in my example data is the required distinct identifier for each observation. If you don't have an identifier (or several variables that taken together are a distinct identifier, you can create one with

Code:

generate id = _n

which assigns the observation number to the identifier variable. The variable does not need to be named id.

The j(j) option creates a variable in the reshaped data that indicates whether in the reshaped data the name came from name1 or from name2 or from name3 (in this example). That is of no use for what we are doing, so I drop it from the dataset. Again the variable does not need to be named j.
Comment
Tesky Koba

Join Date: Mar 2020

Posts: 17
#6

15 Mar 2020, 16:35

I am a beginner on stata, please I send you my database if you can help me to divide q404_ou_avezvous_entendu that each modality becomes a new variable and join the do file it is from do file that j will have the ease to follow your scheme and understand the process. these different modalities are separated by space. please
Attached Files

Base des données étude CAP Tesky KOBA ESP 2020.csv (349.6 KB, 1 view)
Comment
Tesky Koba

Join Date: Mar 2020

Posts: 17
#7

15 Mar 2020, 17:24

Thank you
With great difficulty I succeeded, it works now.
God bless you.

You can always send me the do file so that I can compare the two results. thank you very much once again.
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

15 Mar 2020, 18:47

Here is my code, with annotation.

Code:

// note the option "encoding(utf8)" to correctly read non-ASCII characters
import delimited "~/Downloads/Base des données étude CAP Tesky KOBA ESP 2020.csv", encoding(utf8)
// generate an id variable for reshape
generate int id = _n
// split into entendu1 entendu2 ... entendu10 for the 10 possible values
split q404_ou_avezvous_entendu, generate(entendu)
// reshape long - 10 observations for each original observation - one for each value
reshape long entendu, i(id) j(j)
// drop observations with a blank value - because there weren't that many values given
drop if entendu==""
// we don't need the j variable - it doesn't matter what order the values were in
drop j
// generate an indicator variable
generate byte entendu_ = 1
// reshape wide - one observation for each id, as it was when we began
reshape wide entendu_, i(id) j(entendu) string
// replace missing values with 0 - those were the values that were not chosen
mvencode entendu_*, mv(0)
// don't need the id variable - it was only to put the pieces together again
drop id
// reshape assigns value labels that are not helpful to you, so we remove them
foreach v of varlist entendu_* {
    label variable `v'
    }
// this is what we have
describe entendu*, fullnames

Here are the results.

Code:

. // note the option "encoding(utf8)" to correctly read non-ASCII characters
. import delimited "~/Downloads/Base des données étude CAP Tesky KOBA ESP 2020.csv", encoding(
> utf8)
(184 vars, 348 obs)

. // generate an id variable for reshape
. generate int id = _n

. // split into entendu1 entendu2 ... entendu10 for the 10 possible values
. split q404_ou_avezvous_entendu, generate(entendu)
variables created as string: 
entendu1   entendu3   entendu5   entendu7   entendu9
entendu2   entendu4   entendu6   entendu8   entendu10

. // reshape long - 10 observations for each original observation - one for each value
. reshape long entendu, i(id) j(j)
(note: j = 1 2 3 4 5 6 7 8 9 10)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      348   ->    3480
Number of variables                 195   ->     187
j variable (10 values)                    ->   j
xij variables:
        entendu1 entendu2 ... entendu10   ->   entendu
-----------------------------------------------------------------------------

. // drop observations with a blank value - because there weren't that many values given
. drop if entendu==""
(2,453 observations deleted)

. // we don't need the j variable - it doesn't matter what order the values were in
. drop j

. // generate an indicator variable
. generate byte entendu_ = 1

. // reshape wide - one observation for each id, as it was when we began
. reshape wide entendu_, i(id) j(entendu) string
(note: j = amies autres ecole eglise frères lecture_personnelle nesaitpas parents personnel_medi
> cal radio_tv reco reseaux_sociaux sœurs)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                     1027   ->     342
Number of variables                 187   ->     198
j variable (13 values)          entendu   ->   (dropped)
xij variables:
                               entendu_   ->   entendu_amies entendu_autres ... entendu_sœurs
-----------------------------------------------------------------------------

. // replace missing values with 0 - those were the values that were not chosen
. mvencode entendu_*, mv(0)
entendu_am~s: 139 missing values recoded
entendu_au~s: 336 missing values recoded
entendu_ec~e: 191 missing values recoded
entendu_eg~e: 296 missing values recoded
entendu_fr~s: 324 missing values recoded
entendu_le~e: 229 missing values recoded
entendu_ne~s: 329 missing values recoded
entendu_pa~s: 293 missing values recoded
entendu_pe~l: 247 missing values recoded
entendu_ra~v: 212 missing values recoded
entendu_reco: 330 missing values recoded
entendu_re~x: 205 missing values recoded
entendu_sœ~s: 288 missing values recoded

. // don't need the id variable - it was only to put the pieces together again
. drop id

. // reshape assigns value labels that are not helpful to you, so we remove them
. foreach v of varlist entendu_* {
  2.     label variable `v'
  3.     }

. // this is what we have
. describe entendu*, fullnames

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------
entendu_amies   byte    %8.0g                 
entendu_autres  byte    %8.0g                 
entendu_ecole   byte    %8.0g                 
entendu_eglise  byte    %8.0g                 
entendu_frères  byte    %8.0g                 
entendu_lecture_personnelle
                byte    %8.0g                 
entendu_nesaitpas
                byte    %8.0g                 
entendu_parents byte    %8.0g                 
entendu_personnel_medical
                byte    %8.0g                 
entendu_radio_tv
                byte    %8.0g                 
entendu_reco    byte    %8.0g                 
entendu_reseaux_sociaux
                byte    %8.0g                 
entendu_sœurs   byte    %8.0g

Comment

Tesky Koba

Join Date: Mar 2020

Posts: 17
#9

16 Mar 2020, 04:06

Thank you very much William for your help. God bless you.
Everything is clear now thanks you.
Comment

Announcement