Difficulty using substring function

Chris Boulis

Join Date: Feb 2019

Posts: 355
#1

Difficulty using substring function

18 Oct 2019, 03:06

Hi StataList.

I am trying to generate a wave variable in a panel dataset using substring (as I had seen used elsewhere). A wave prefix letter is attached to the beginning of each variable denoting the wave, e.g. 'a' as in asex = wave 1, 'b' as in bsex = wave 2, etc hence the string "a, b, c, ...". I read a number of previous posts but I was not able to find a solution or to know whether this function is the right one for the code I'm trying to write. Help is appreciated.

Code:

local wave = substr(c(alpha), 1, 17) // I tried this without success //local waveprefix = substr("abcdefghijklmnopq","a" , 17) // I tried this without success also foreach waveprefix in `wave'{ quietly{ use `origdatadir'/Combined_`waveprefix'170c.dta, clear } keep `list' gen wave = index("abcdefghijklmnopqrstuvwxyz","`waveprefix'") lab var wave "Wave of the survey" sort hhpxid save "`mergedata'", replace }
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

18 Oct 2019, 04:36

I'm not sure what you're trying to do, but it feels as if you want to loop over the first 17 letters of the alphabet? The problem in your current example is that Stata has no way of knowing it needs to loop over the individual letters and instead just sees the big block.

I think what you need is something along the following lines

Code:

local waves "a b c d e f g h i j k l m n o p q" foreach wave of local waves { di "`wave'" }

In this example, the local wave will take value a in the first iteration, b in the second etc. You can then plug `wave' into your filenames, variablenames, etc.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35173
#3

18 Oct 2019, 04:44

"without success" is just a variant on the dreaded (by people who answer questions) "doesn't work", which is warned against in the FAQ Advice.

What happened exactly? Why wasn't it what you wanted/needed/expected?

Use display to help debug your code.

Code:

. di substr(c(alpha), 1, 17) a b c d e f g h i

The first construct is legal, but the first 17 characters include 8 spaces, not what you want here.

Code:

. di substr("abcdefghijklmnopq","a" , 17) type mismatch r(109);

The second construct is illegal. substr() expects (an expression that evaluates to) a number where you typed "a". There is no point in fixing that as there is a much simpler way of getting closer to what you want.

Here is some code:

Code:

forval w = 1/17 { local wave = word(c(alpha), `w') quietly use `origdatadir'/Combined_`wave'170c.dta, clear keep `list' gen wave = `w' lab var wave "Wave of the survey" sort hhpxid save "`mergedata'", replace }

It looks as if your filenames have different elements a b c but you want a variable that records whichever of 1 2 3 fits the case.

But, but, but:

1. You don't define local macro list so that we can see it, so that is your call. As you're implying to me that the variable names differ from file to file, that may be wrong.

2. More worrying, you are repeatedly overwriting the file with name in local mergedata. Unless there is more code that you are not showing us, that looks quite wrong. That file will just end up with the data for the last wave, and only that.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

18 Oct 2019, 11:14

I recognize these conventions as those used by the HILDA survey from Australia. Each file has the wave prefix as part of the name, and each variable in that file has the wave prefix as the first character, as indicated in post #1 for asex, bsex, ... .

I hope the following untested code will start you in a useful direction, producing a dataset in a "long layout" suitable for use in longitudinal analysis as described in the Stata Longitudinal-Data/Panel-Data Reference Manual PDF included in your Stata installation and accessible from Stata's Help menu.

Code:

// directory paths local origdatadir ... local savingdir ... // create empty output dataset local filename allwaves clear save "`savingdir'/`filename'", replace emptyok // variables wanted - ahgage ahgsex bhgage bhgsex ... local variables hgage hgsex forvalues wave=1/17 { local waveprefix = word(c(alpha), `wave') use "`origdatadir'/Combined_`waveprefix'170c.dta", clear rename `waveprefix'* * keep xwaveid `variables' generate byte wave = `wave' append using "`savingdir'/`filename'" save "`savingdir'/`filename'", replace } sort xwaveid wave xtset xwaveid wave save "`savingdir'/`filename'", replace
3 likes
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 355
#5

19 Oct 2019, 06:27

Thank you for your help Jesse Wursten I appreciate your reply.
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 355
#6

19 Oct 2019, 07:35

Thank you Nick Cox. Point taken, I need to be clearer in stating the problem. To reply to your questions, I haven't used substring before and even after reading Statalist and elsewhere I wasn't able to figure out how to use it, nor did I know if it was what I needed to create the wave variable in my code. Yes, as noted in my post above

a wave prefix letter is attached to the beginning of each variable denoting the wave, e.g. 'a' as in asex = wave 1, 'b' as in bsex = wave 2, etc

Yes you are correct, I would like to convert the wave identifier from a b c ... to 1 2 3 ... By the way, the "keep list" code is part of the code I used from some 'proposed code' I found in my research so it is likely not relevant to my use as such I will delete this line from my code.
The variable names as noted are the same, it is just that each variable has a single letter prefix which denotes the wave from which it relates. That is, age is age, but the prefix to wave changes each wave, but age in wave 1 is aage, age in wave 2 is bage, age in wave 3 is cage. With reference to

More worrying, you are repeatedly overwriting the file with name in local mergedata.

can you suggest what I should do about this? There is no other code for the 'generating a wave' syntax. Thank you for showing me the errors in my code and for your suggested code. However, when I ran the code Stata gave the following response

file /Combined_a170c.dta not found r(601)

but it is in my folder. I'll try William's suggested code and reply shortly.

Last edited by Chris Boulis; 19 Oct 2019, 07:40.
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 355

20 Oct 2019, 23:56

Thank you for your reply and code William Lisowski. Yes I am using HILDA survey data. One issue with the local variables line of code, three variables of interest only appear in four waves of data (4, 7, 10, 14) so when I run the code it says "variable 'relb' not found". That said, this code appears to relate to the data. My code to pull in the partner/respondent data from the original dataset follows (I appreciate your thoughts):

Code:

* partner data
local wave a b c d e f g h i j k l m n o p q
foreach `waveprefix' of local wave {
use `origdatadir'/Combined_`x'170c.dta, clear
    rename `x'* p_*                            // replaces wave-prefix with partner prefix
    rename p_hhpxid hhpxid             // above code converts `x'hhpxid to p_hhpxid     rename xwaveid hhpxid              // xwaveid is unique individual identifier for all waves; hhpxid - partner's cross wave identifier
 sort hhpxid

* respondent data
use `origdatadir'/Combined_`x'170c.dta, clear
    rename `x'* *                    
    drop if hhpxid=="" 
sort hhpxid
merge 1:1 hhpxid using "`mergedata'", replace 
tempfile append
append using "`append'", replace emptyok
save "`append'", replace
}
keep xwaveid hhpxid age sex empstat educ wage marstat religb relimp relat ///
p_age p_sex p_empstat p_educ p_wage p_marstat p_relb p_relimp p_relat // vars of interest
     destring hhpxid, replace
    sort hhpxid
tsset hhpxid wave
save `newdatadir'/partner_rperson.dta, replace emptyok

Could you kindly include a link to the Stata Longitudinal-Data/Panel-Data Reference Manual PDF as I was not able to find it in Stata's help menu.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

21 Oct 2019, 07:03

I do not at the moment have the time to address your code. But I can tell you that on my setup, when I choose PDF Documentation from Stata's Help menu, Acrobat Reader opens a PDF whose first page declares it to be the Stata Glossary And Index. If the Bookmarks sidebar is not displayed, I choose View > Show/Hide > Navigation Panes > Bookmarks and scroll to the bookmark with the title of the PDF. Clicking on that opens the PDF in Acrobat Reader.
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

21 Oct 2019, 16:22

Here is a demonstration of how I accomplish what I think you want from your code in post #7. Comments follow,

Code:

// directory paths
local origdatadir /Volumes/Research/HILDA/Data170c
local savingdir   /Volumes/Research/testing

// variables wanted
local variables xwaveid hhpxid hgage hgsex religb relimp relat

// create empty output dataset
local filename allwaves
clear
save "`savingdir'/`filename'", replace emptyok

// gather individual data for members of couples
// just 3 waves to demonstrate
forvalues wave=3/5 {
    local waveprefix = word(c(alpha), `wave')
    use "`origdatadir'/Combined_`waveprefix'170c.dta", clear
    rename `waveprefix'* *
    drop if hhpxid==""
    quietly ds
    local allvars `r(varlist)'
    local thiswave : list variables & allvars
    generate byte wave = `wave'
    keep xwaveid hhpxid wave `thiswave'
    display "Wave `wave' (`waveprefix') - kept `thiswave'
    append using "`savingdir'/`filename'"
    save "`savingdir'/`filename'", replace
}

// create partner data
tempfile partners
drop hhpxid
rename * p_*
rename (p_xwaveid p_wave) (hhpxid wave)
save `partners'

// merge
use "`savingdir'/`filename'"
merge 1:1 hhpxid wave using `partners', nolabel
// confirm that every observation matched
assert _merge==3
drop _merge

// make into panel data
destring xwaveid hhpxid, replace
format %07.0f xwaveid hhpxid
sort xwaveid wave
xtset xwaveid wave
save "`savingdir'/`filename'", replace

describe

Code:

. // directory paths
. local origdatadir /Volumes/Research/HILDA/Data170c

. local savingdir /Volumes/Research/testing

.
. // variables wanted
. local variables xwaveid hhpxid hgage hgsex religb relimp relat

.
. // create empty output dataset
. local filename allwaves

. clear

. save "`savingdir'/`filename'", replace emptyok
(note: dataset contains 0 observations)
file /Volumes/Research/testing/allwaves.dta saved

.
. // gather individual data for members of couples
. // just 3 waves to demonstrate
. forvalues wave=3/5 {
2. local waveprefix = word(c(alpha), `wave')
3. use "`origdatadir'/Combined_`waveprefix'170c.dta", clear
4. rename `waveprefix'* *
5. drop if hhpxid==""
6. quietly ds
7. local allvars `r(varlist)'
8. local thiswave : list variables & allvars
9. generate byte wave = `wave'
10. keep xwaveid hhpxid wave `thiswave'
11. display "Wave `wave' (`waveprefix') - kept `thiswave'
12. append using "`savingdir'/`filename'"
13. save "`savingdir'/`filename'", replace
14. }
(-,--- observations deleted)
Wave 3 (c) - kept xwaveid hhpxid hgage hgsex
file /Volumes/Research/testing/allwaves.dta saved
(-,--- observations deleted)
Wave 4 (d) - kept xwaveid hhpxid hgage hgsex religb relimp relat
file /Volumes/Research/testing/allwaves.dta saved
(-,--- observations deleted)
Wave 5 (e) - kept xwaveid hhpxid hgage hgsex
file /Volumes/Research/testing/allwaves.dta saved

.
. // create partner data
. tempfile partners

. drop hhpxid

. rename * p_*

. rename (p_xwaveid p_wave) (hhpxid wave)

. save `partners'
file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_82715.000001 saved

.
. // merge
. use "`savingdir'/`filename'"

. merge 1:1 hhpxid wave using `partners', nolabel

Result # of obs.
-----------------------------------------
not matched 0
matched --,--- (_merge==3)
-----------------------------------------

. // confirm that every observation matched
. assert _merge==3

. drop _merge

.
. // make into panel data
. destring xwaveid hhpxid, replace
xwaveid: all characters numeric; replaced as long
hhpxid: all characters numeric; replaced as long

. format %07.0f xwaveid hhpxid

. sort xwaveid wave

. xtset xwaveid wave
panel variable: xwaveid (unbalanced)
time variable: wave, 3 to 5, but with gaps
delta: 1 unit

. save "`savingdir'/`filename'", replace
file /Volumes/Research/testing/allwaves.dta saved

.
. describe

Contains data from /Volumes/Research/testing/allwaves.dta
obs: --,---
vars: 13 21 Oct 2019 17:02
------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------
xwaveid long %07.0f XW Cross wave ID
hgsex byte %31.0g ESEX HF5 Sex
hgage byte %31.0g EHGAGE DV: Age last birthday at June 30 2005
hhpxid long %07.0f DV: Partner's xwaveid (text, 7-digit)
wave byte %8.0g
relimp byte %49.0g DRELIMP SCQ:B18 Importance of religion
religb int %55.0g DRELIGB DV: [SCQ] Religion - broad
relat byte %31.0g DRELAT SCQ:B20 Frequency of attendance at religious
services
p_hgsex byte %31.0g ESEX HF5 Sex
p_hgage byte %31.0g EHGAGE DV: Age last birthday at June 30 2005
p_relimp byte %49.0g DRELIMP SCQ:B18 Importance of religion
p_religb int %55.0g DRELIGB DV: [SCQ] Religion - broad
p_relat byte %31.0g DRELAT SCQ:B20 Frequency of attendance at religious
services
------------------------------------------------------------------------------------------------
Sorted by: xwaveid wave

end of do-file

I do not understand why your code showed variables age and sex when the HILDA documentation (Subject Level Coding Framework 170c.pdf) shows hgage and hgsex, so I went with what was available in the dataset.

In the output I have obscured all counts because data protection.

Note that the variable labels and their value labels are taken from the last wave in which the variable appeared. Some of the variable labels need editing (like hgage) and it is up to you to confirm that the value labels are appropriate for all waves, or to eliminate them.

In any event you will need to recode the negative values that indicate missing data to Stata missing valuesee (see the output of help mvencode, for example).

I'm not sure why you chose to use the partner's xwaveid as your panel variable, I chose to use the xwaveid. It doesn't make a substantive difference, but confirms more closely to common practice.

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 355
#10

23 Oct 2019, 03:20

Thank you William Lisowski I now have a wave variable. Regarding your code to pull in the partner/respondent data, is it the case that because you keep the vars upfront (this is the respondent data obviously) you only need code to pull in the partner data then merge with the respondent data? Also I don't understand why you don't pull the partner data from the original data file (Combined file)? To me it seems that the respondent data is recoded as partner data - surely this is not the case. I also do not understand how the partner cross-wave id (hhpxid) and the unique individual id (xwaveid) in being used to link the partner to the respondent and how this is then merged using xwaveid or hhpxid. I would really appreciate it if you could kindly explain this code to me. Thank you in advance.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#11

23 Oct 2019, 09:42

You are not thinking correctly about the contents of the Combined dataset.

Suppose that the Combined dataset contains the following individual, with their age.
xwaveid hhxpid hgage

1 2 31

It will also contain an observation for the partner individual with their age.
xwaveid hhxpid hgage

1 2 31

2 1 42

And these observations will both appear in the "`savingdir'/`filename'" dataset which you think of as the "respondents".

So the temporary `partners' dataset my code creates using the "`savingdir'/`filename'" dataset as input will contain the following observations.
xwaveid p_hgage

2 31

1 42

And after merging the "`savingdir'/`filename'" dataset will contain
xwaveid hhxpid hgage p_hgage

1 2 31 42

2 1 42 31

I suggest you run my example code as shown - on waves 3 4 5 - and then

Code:

list xwaveid hhpxid wave hgage p_hgage in 1/14, sepby(xwaveid)

and review the output. After that,

Code:

browse

and see the full details to build your understanding of what HILDA data is like and what my code has produced.

Last edited by William Lisowski; 23 Oct 2019, 09:57.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 355
#12

23 Oct 2019, 23:55

Thank you William Lisowski. Your explanation is very helpful. That makes more sense now.

Is the following code saying to drop any individuals without partners?

drop if hhpxid==""

so if I want to create a dataset of people with and without partners, would I need to delete that line of code and choose a different variable to merge on? On that I tried to merge on 'hhpxid', then 'xwaveid' however both do "not uniquely identify observations in the master (or using) data"? I appreciate help with this.

Last edited by Chris Boulis; 24 Oct 2019, 00:15.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#13

24 Oct 2019, 07:18

You are one-third of the way there.

Yes, if you remove

Code:

drop if hhpxid==""

from its current location, the "master' dataset will now have observations for all individuals, not just those in couples.

But the "partners" dataset (created using the "master" dataset) will now also have observations for all individuals, not just those who are partners. You need to add that command into the the code that creates the partners dataset.

Code:

... drop if hhpxid=="" drop hhpxid ...

Finally, when you merge the master dataset to the partners dataset, the individuals in the master dataset who do not have partners will not match, You need to change the assert command.

Code:

... // confirm that every observation with a partner matched or had no partner assert (hhpxid!="" & _merge==3) | (hhpxid=="" & _merge==1) drop _merge ...

I haven't tested the code with these changes applied, but they should start you on your way.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 355
#14

24 Oct 2019, 22:42

Thanks William Lisowski that worked nicely. Appreciate your help. Best Chris
Comment

Announcement