Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difficulty using substring function

    Hi StataList.

    I am trying to generate a wave variable in a panel dataset using substring (as I had seen used elsewhere). A wave prefix letter is attached to the beginning of each variable denoting the wave, e.g. 'a' as in asex = wave 1, 'b' as in bsex = wave 2, etc hence the string "a, b, c, ...". I read a number of previous posts but I was not able to find a solution or to know whether this function is the right one for the code I'm trying to write. Help is appreciated.

    Code:
    local wave = substr(c(alpha), 1, 17)                      // I tried this without success
    //local waveprefix = substr("abcdefghijklmnopq","a" , 17)  // I tried this without success also
        foreach waveprefix in `wave'{
            quietly{
                use `origdatadir'/Combined_`waveprefix'170c.dta, clear
            }
                keep `list'
                gen wave = index("abcdefghijklmnopqrstuvwxyz","`waveprefix'")
                lab var wave "Wave of the survey"
                sort hhpxid
                save "`mergedata'", replace
            }

  • #2
    I'm not sure what you're trying to do, but it feels as if you want to loop over the first 17 letters of the alphabet? The problem in your current example is that Stata has no way of knowing it needs to loop over the individual letters and instead just sees the big block.

    I think what you need is something along the following lines
    Code:
    local waves "a b c d e f g h i j k l m n o p q"
    
    foreach wave of local waves {
        di "`wave'"
    }
    In this example, the local wave will take value a in the first iteration, b in the second etc. You can then plug `wave' into your filenames, variablenames, etc.

    Comment


    • #3
      "without success" is just a variant on the dreaded (by people who answer questions) "doesn't work", which is warned against in the FAQ Advice.

      What happened exactly? Why wasn't it what you wanted/needed/expected?

      Use display to help debug your code.

      Code:
      . di substr(c(alpha), 1, 17)
      a b c d e f g h i
      The first construct is legal, but the first 17 characters include 8 spaces, not what you want here.


      Code:
      . di substr("abcdefghijklmnopq","a" , 17)
      type mismatch
      r(109);
      The second construct is illegal. substr() expects (an expression that evaluates to) a number where you typed "a". There is no point in fixing that as there is a much simpler way of getting closer to what you want.

      Here is some code:

      Code:
      forval w = 1/17 {
      
           local wave = word(c(alpha), `w')
           quietly use `origdatadir'/Combined_`wave'170c.dta, clear
           keep `list'
           gen wave = `w'
           lab var wave "Wave of the survey"
           sort hhpxid
           save "`mergedata'", replace
      
      }
      It looks as if your filenames have different elements a b c but you want a variable that records whichever of 1 2 3 fits the case.

      But, but, but:

      1. You don't define local macro list so that we can see it, so that is your call. As you're implying to me that the variable names differ from file to file, that may be wrong.

      2. More worrying, you are repeatedly overwriting the file with name in local mergedata. Unless there is more code that you are not showing us, that looks quite wrong. That file will just end up with the data for the last wave, and only that.




      Comment


      • #4
        I recognize these conventions as those used by the HILDA survey from Australia. Each file has the wave prefix as part of the name, and each variable in that file has the wave prefix as the first character, as indicated in post #1 for asex, bsex, ... .

        I hope the following untested code will start you in a useful direction, producing a dataset in a "long layout" suitable for use in longitudinal analysis as described in the Stata Longitudinal-Data/Panel-Data Reference Manual PDF included in your Stata installation and accessible from Stata's Help menu.
        Code:
        // directory paths
        local origdatadir ...
        local savingdir ...
        
        // create empty output dataset
        local filename allwaves
        clear
        save "`savingdir'/`filename'", replace emptyok
        
        // variables wanted - ahgage ahgsex bhgage bhgsex ...
        local variables hgage hgsex
        
        forvalues wave=1/17 {
            local waveprefix = word(c(alpha), `wave')
            use "`origdatadir'/Combined_`waveprefix'170c.dta", clear
            rename `waveprefix'* *
            keep xwaveid `variables'
            generate byte wave = `wave'
            append using "`savingdir'/`filename'"
            save "`savingdir'/`filename'", replace
        }
        
        sort xwaveid wave
        xtset xwaveid wave
        save "`savingdir'/`filename'", replace

        Comment


        • #5
          Thank you for your help Jesse Wursten I appreciate your reply.

          Comment


          • #6
            Thank you Nick Cox. Point taken, I need to be clearer in stating the problem. To reply to your questions, I haven't used substring before and even after reading Statalist and elsewhere I wasn't able to figure out how to use it, nor did I know if it was what I needed to create the wave variable in my code. Yes, as noted in my post above
            a wave prefix letter is attached to the beginning of each variable denoting the wave, e.g. 'a' as in asex = wave 1, 'b' as in bsex = wave 2, etc
            Yes you are correct, I would like to convert the wave identifier from a b c ... to 1 2 3 ... By the way, the "keep list" code is part of the code I used from some 'proposed code' I found in my research so it is likely not relevant to my use as such I will delete this line from my code.
            The variable names as noted are the same, it is just that each variable has a single letter prefix which denotes the wave from which it relates. That is, age is age, but the prefix to wave changes each wave, but age in wave 1 is aage, age in wave 2 is bage, age in wave 3 is cage. With reference to
            More worrying, you are repeatedly overwriting the file with name in local mergedata.
            can you suggest what I should do about this? There is no other code for the 'generating a wave' syntax. Thank you for showing me the errors in my code and for your suggested code. However, when I ran the code Stata gave the following response
            file /Combined_a170c.dta not found r(601)
            but it is in my folder. I'll try William's suggested code and reply shortly.
            Last edited by Chris Boulis; 19 Oct 2019, 07:40.

            Comment


            • #7
              Thank you for your reply and code William Lisowski. Yes I am using HILDA survey data. One issue with the local variables line of code, three variables of interest only appear in four waves of data (4, 7, 10, 14) so when I run the code it says "variable 'relb' not found". That said, this code appears to relate to the data. My code to pull in the partner/respondent data from the original dataset follows (I appreciate your thoughts):
              Code:
              * partner data
              local wave a b c d e f g h i j k l m n o p q
              foreach `waveprefix' of local wave {
              use `origdatadir'/Combined_`x'170c.dta, clear
                  rename `x'* p_*                            // replaces wave-prefix with partner prefix
                  rename p_hhpxid hhpxid             // above code converts `x'hhpxid to p_hhpxid     rename xwaveid hhpxid              // xwaveid is unique individual identifier for all waves; hhpxid - partner's cross wave identifier
               sort hhpxid
              
              * respondent data
              use `origdatadir'/Combined_`x'170c.dta, clear
                  rename `x'* *                    
                  drop if hhpxid=="" 
              sort hhpxid
              merge 1:1 hhpxid using "`mergedata'", replace 
              tempfile append
              append using "`append'", replace emptyok
              save "`append'", replace
              }
              keep xwaveid hhpxid age sex empstat educ wage marstat religb relimp relat ///
              p_age p_sex p_empstat p_educ p_wage p_marstat p_relb p_relimp p_relat // vars of interest
                   destring hhpxid, replace
                  sort hhpxid
              tsset hhpxid wave
              save `newdatadir'/partner_rperson.dta, replace emptyok
              Could you kindly include a link to the Stata Longitudinal-Data/Panel-Data Reference Manual PDF as I was not able to find it in Stata's help menu.

              Comment


              • #8
                I do not at the moment have the time to address your code. But I can tell you that on my setup, when I choose PDF Documentation from Stata's Help menu, Acrobat Reader opens a PDF whose first page declares it to be the Stata Glossary And Index. If the Bookmarks sidebar is not displayed, I choose View > Show/Hide > Navigation Panes > Bookmarks and scroll to the bookmark with the title of the PDF. Clicking on that opens the PDF in Acrobat Reader.

                Comment


                • #9
                  Here is a demonstration of how I accomplish what I think you want from your code in post #7. Comments follow,
                  Code:
                  // directory paths
                  local origdatadir /Volumes/Research/HILDA/Data170c
                  local savingdir   /Volumes/Research/testing
                  
                  // variables wanted
                  local variables xwaveid hhpxid hgage hgsex religb relimp relat
                  
                  // create empty output dataset
                  local filename allwaves
                  clear
                  save "`savingdir'/`filename'", replace emptyok
                  
                  // gather individual data for members of couples
                  // just 3 waves to demonstrate
                  forvalues wave=3/5 {
                      local waveprefix = word(c(alpha), `wave')
                      use "`origdatadir'/Combined_`waveprefix'170c.dta", clear
                      rename `waveprefix'* *
                      drop if hhpxid==""
                      quietly ds
                      local allvars `r(varlist)'
                      local thiswave : list variables & allvars
                      generate byte wave = `wave'
                      keep xwaveid hhpxid wave `thiswave'
                      display "Wave `wave' (`waveprefix') - kept `thiswave'
                      append using "`savingdir'/`filename'"
                      save "`savingdir'/`filename'", replace
                  }
                  
                  // create partner data
                  tempfile partners
                  drop hhpxid
                  rename * p_*
                  rename (p_xwaveid p_wave) (hhpxid wave)
                  save `partners'
                  
                  // merge
                  use "`savingdir'/`filename'"
                  merge 1:1 hhpxid wave using `partners', nolabel
                  // confirm that every observation matched
                  assert _merge==3
                  drop _merge
                  
                  // make into panel data
                  destring xwaveid hhpxid, replace
                  format %07.0f xwaveid hhpxid
                  sort xwaveid wave
                  xtset xwaveid wave
                  save "`savingdir'/`filename'", replace
                  
                  describe
                  Code:
                  . // directory paths
                  . local origdatadir /Volumes/Research/HILDA/Data170c
                  
                  . local savingdir   /Volumes/Research/testing
                  
                  . 
                  . // variables wanted
                  . local variables xwaveid hhpxid hgage hgsex religb relimp relat
                  
                  . 
                  . // create empty output dataset
                  . local filename allwaves
                  
                  . clear
                  
                  . save "`savingdir'/`filename'", replace emptyok
                  (note: dataset contains 0 observations)
                  file /Volumes/Research/testing/allwaves.dta saved
                  
                  . 
                  . // gather individual data for members of couples
                  . // just 3 waves to demonstrate
                  . forvalues wave=3/5 {
                    2.     local waveprefix = word(c(alpha), `wave')
                    3.     use "`origdatadir'/Combined_`waveprefix'170c.dta", clear
                    4.     rename `waveprefix'* *
                    5.     drop if hhpxid==""
                    6.     quietly ds
                    7.     local allvars `r(varlist)'
                    8.     local thiswave : list variables & allvars
                    9.     generate byte wave = `wave'
                   10.     keep xwaveid hhpxid wave `thiswave'
                   11.     display "Wave `wave' (`waveprefix') - kept `thiswave'
                   12.     append using "`savingdir'/`filename'"
                   13.     save "`savingdir'/`filename'", replace
                   14. }
                  (-,--- observations deleted)
                  Wave 3 (c) - kept xwaveid hhpxid hgage hgsex
                  file /Volumes/Research/testing/allwaves.dta saved
                  (-,--- observations deleted)
                  Wave 4 (d) - kept xwaveid hhpxid hgage hgsex religb relimp relat
                  file /Volumes/Research/testing/allwaves.dta saved
                  (-,--- observations deleted)
                  Wave 5 (e) - kept xwaveid hhpxid hgage hgsex
                  file /Volumes/Research/testing/allwaves.dta saved
                  
                  . 
                  . // create partner data
                  . tempfile partners
                  
                  . drop hhpxid
                  
                  . rename * p_*
                  
                  . rename (p_xwaveid p_wave) (hhpxid wave)
                  
                  . save `partners'
                  file /var/folders/xr/lm5ccr996k7dspxs35yqzyt80000gp/T//S_82715.000001 saved
                  
                  . 
                  . // merge
                  . use "`savingdir'/`filename'"
                  
                  . merge 1:1 hhpxid wave using `partners', nolabel
                  
                      Result                           # of obs.
                      -----------------------------------------
                      not matched                             0
                      matched                            --,---  (_merge==3)
                      -----------------------------------------
                  
                  . // confirm that every observation matched
                  . assert _merge==3
                  
                  . drop _merge
                  
                  . 
                  . // make into panel data
                  . destring xwaveid hhpxid, replace
                  xwaveid: all characters numeric; replaced as long
                  hhpxid: all characters numeric; replaced as long
                  
                  . format %07.0f xwaveid hhpxid
                  
                  . sort xwaveid wave
                  
                  . xtset xwaveid wave
                         panel variable:  xwaveid (unbalanced)
                          time variable:  wave, 3 to 5, but with gaps
                                  delta:  1 unit
                  
                  . save "`savingdir'/`filename'", replace
                  file /Volumes/Research/testing/allwaves.dta saved
                  
                  . 
                  . describe
                  
                  Contains data from /Volumes/Research/testing/allwaves.dta
                    obs:        --,---                          
                   vars:            13                          21 Oct 2019 17:02
                  ------------------------------------------------------------------------------------------------
                                storage   display    value
                  variable name   type    format     label      variable label
                  ------------------------------------------------------------------------------------------------
                  xwaveid         long    %07.0f                XW Cross wave ID
                  hgsex           byte    %31.0g     ESEX       HF5 Sex
                  hgage           byte    %31.0g     EHGAGE     DV: Age last birthday at June 30 2005
                  hhpxid          long    %07.0f                DV: Partner's xwaveid (text, 7-digit)
                  wave            byte    %8.0g                 
                  relimp          byte    %49.0g     DRELIMP    SCQ:B18 Importance of religion
                  religb          int     %55.0g     DRELIGB    DV: [SCQ] Religion - broad
                  relat           byte    %31.0g     DRELAT     SCQ:B20 Frequency of attendance at religious
                                                                  services
                  p_hgsex         byte    %31.0g     ESEX       HF5 Sex
                  p_hgage         byte    %31.0g     EHGAGE     DV: Age last birthday at June 30 2005
                  p_relimp        byte    %49.0g     DRELIMP    SCQ:B18 Importance of religion
                  p_religb        int     %55.0g     DRELIGB    DV: [SCQ] Religion - broad
                  p_relat         byte    %31.0g     DRELAT     SCQ:B20 Frequency of attendance at religious
                                                                  services
                  ------------------------------------------------------------------------------------------------
                  Sorted by: xwaveid  wave
                  
                  end of do-file
                  I do not understand why your code showed variables age and sex when the HILDA documentation (Subject Level Coding Framework 170c.pdf) shows hgage and hgsex, so I went with what was available in the dataset.

                  In the output I have obscured all counts because data protection.

                  Note that the variable labels and their value labels are taken from the last wave in which the variable appeared. Some of the variable labels need editing (like hgage) and it is up to you to confirm that the value labels are appropriate for all waves, or to eliminate them.

                  In any event you will need to recode the negative values that indicate missing data to Stata missing valuesee (see the output of help mvencode, for example).

                  I'm not sure why you chose to use the partner's xwaveid as your panel variable, I chose to use the xwaveid. It doesn't make a substantive difference, but confirms more closely to common practice.

                  Comment


                  • #10
                    Thank you William Lisowski I now have a wave variable. Regarding your code to pull in the partner/respondent data, is it the case that because you keep the vars upfront (this is the respondent data obviously) you only need code to pull in the partner data then merge with the respondent data? Also I don't understand why you don't pull the partner data from the original data file (Combined file)? To me it seems that the respondent data is recoded as partner data - surely this is not the case. I also do not understand how the partner cross-wave id (hhpxid) and the unique individual id (xwaveid) in being used to link the partner to the respondent and how this is then merged using xwaveid or hhpxid. I would really appreciate it if you could kindly explain this code to me. Thank you in advance.

                    Comment


                    • #11
                      You are not thinking correctly about the contents of the Combined dataset.

                      Suppose that the Combined dataset contains the following individual, with their age.
                      xwaveid hhxpid hgage
                      1 2 31
                      It will also contain an observation for the partner individual with their age.
                      xwaveid hhxpid hgage
                      1 2 31
                      2 1 42
                      And these observations will both appear in the "`savingdir'/`filename'" dataset which you think of as the "respondents".

                      So the temporary `partners' dataset my code creates using the "`savingdir'/`filename'" dataset as input will contain the following observations.
                      xwaveid p_hgage
                      2 31
                      1 42
                      And after merging the "`savingdir'/`filename'" dataset will contain
                      xwaveid hhxpid hgage p_hgage
                      1 2 31 42
                      2 1 42 31
                      I suggest you run my example code as shown - on waves 3 4 5 - and then
                      Code:
                      list xwaveid hhpxid wave hgage p_hgage in 1/14, sepby(xwaveid)
                      and review the output. After that,
                      Code:
                      browse
                      and see the full details to build your understanding of what HILDA data is like and what my code has produced.
                      Last edited by William Lisowski; 23 Oct 2019, 09:57.

                      Comment


                      • #12
                        Thank you William Lisowski. Your explanation is very helpful. That makes more sense now.

                        Is the following code saying to drop any individuals without partners?
                        drop if hhpxid==""
                        so if I want to create a dataset of people with and without partners, would I need to delete that line of code and choose a different variable to merge on? On that I tried to merge on 'hhpxid', then 'xwaveid' however both do "not uniquely identify observations in the master (or using) data"? I appreciate help with this.
                        Last edited by Chris Boulis; 24 Oct 2019, 00:15.

                        Comment


                        • #13
                          You are one-third of the way there.

                          Yes, if you remove
                          Code:
                              drop if hhpxid==""
                          from its current location, the "master' dataset will now have observations for all individuals, not just those in couples.

                          But the "partners" dataset (created using the "master" dataset) will now also have observations for all individuals, not just those who are partners. You need to add that command into the the code that creates the partners dataset.
                          Code:
                          ...
                          drop if hhpxid==""
                          drop hhpxid
                          ...
                          Finally, when you merge the master dataset to the partners dataset, the individuals in the master dataset who do not have partners will not match, You need to change the assert command.
                          Code:
                          ...
                          // confirm that every observation with a partner matched or had no partner
                          assert (hhpxid!="" & _merge==3) | (hhpxid=="" & _merge==1)
                          drop _merge
                          ...
                          I haven't tested the code with these changes applied, but they should start you on your way.

                          Comment


                          • #14
                            Thanks William Lisowski that worked nicely. Appreciate your help. Best Chris

                            Comment

                            Working...
                            X