Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshape long NLSY97

    Dear all,

    I am having difficulties with reshape long for the NLSY97 data in Stata 14.0.
    I have read all previous posts and guides but I cannot figure out how to reshape my data.
    I have downloaded over 6.000 variables from the INvestigator webpage. I do not plan to use them all but I want to explore them in the appropiate format.
    I am uploading a sample of the variables. Some have the year in the middle of the name, others at the end, and others have the year in a 2 digits format i.e. CVC_HOURS_WK_YR_SE_00_XRND (00 refers to 2000)


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(PUBID_1997 EMP_STATUS_1994_01_XRND EMP_STATUS_1994_02_XRND EMP_STATUS_1995_01_XRND EMP_STATUS_1995_02_XRND YEMP_101104_01_2000 YEMP_101104_02_2000 YEMP_101106_01_2000 YEMP_101106_02_2000 CVC_HOURS_WK_YR_SE_00_XRND CVC_HOURS_WK_YR_SE_01_XRND)
      1 -4 -4 -4 -4 -4 -4 -4 -4    0    0
      2 -4 -4 -4 -4 -4 -4 -4 -4    0    0
      3 -4 -4 -4 -4 -4 -4 -4 -4    0    0
      4 -4 -4 -4 -4 -4 -4 -4 -4    0    0
      5 -4 -4 -4 -4 -4 -4 -4 -4    0    0
      6 -4 -4 -4 -4 -4 -4 -4 -4    0    0
      7 -4 -4 -4 -4 -4 -4 -4 -4    0    0
      8 -4 -4 -4 -4 -4 -4 -4 -4    0    0
      9 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     10 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     11 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     12 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     13 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     14 -4 -4  5  5 -5 -5 -5 -5    0    0
     15 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     16 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     17 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     18 -4 -4 -4 -4 -4 -4 -4 -4 5194 5096
     19 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     20 -4 -4  5  5 -4 -4 -4 -4    0    0
     21 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     22 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     23 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     24 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     25 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     26 -4 -4  5  5 -4 -4 -4 -4    0    0
     27 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     28 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     29 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     30 -4 -4 -4 -4 -5 -5 -5 -5   -4   -4
     31 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     32 -4 -4 -4 -4 -5 -5 -5 -5    0    0
     33 -4 -4  5  5 -4 -4 -4 -4    0    0
     34 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     35 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     36 -4 -4  5  5 -4 -4 -4 -4    0    0
     37 -4 -4  5  5 -4 -4 -4 -4    0    0
     38 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     39 -4 -4  5  5 -4 -4 -4 -4    0    0
     40 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     41 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     42 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     43 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     44 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     45 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     46 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     47 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     48 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     49 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     50 -4 -4 -4 -4 -4 -4 -4 -4    0   -3
     51 -4 -4  5  5 -4 -4 -4 -4    0    0
     52 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     53 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     54 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     55 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     56 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     57 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     58 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     59 -4 -4 -4 -4 -4 -4 -4 -4    0   -3
     60 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     61 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     62 -4 -4  5  5 -4 -4 -4 -4    0    0
     63 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     64 -4 -4  5  5 -4 -4 -4 -4    0    0
     65 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     66 -4 -4  0  0 -4 -4 -4 -4    0    0
     67 -4 -4  1  1 -4 -4 -4 -4    0    0
     68 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     69 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     70 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     71 -4 -4  5  5 -4 -4 -4 -4    0    0
     72 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     73 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     74 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     75 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     76 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     77 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     78 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     79 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     80 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     81 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     82 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     83 -4 -4 -4 -4 -4 -4 -4 -4    0 1160
     84 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     85 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     86 -4 -4  5  5 -4 -4 -4 -4    0    0
     87 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     88 -4 -4 -4 -4 -4 -4 -4 -4    0 1698
     89 -4 -4 -4 -4 -4 -4  1 -4  782  150
     90 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     91 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     92 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     93 -4 -4 -4 -4 -4 -4 -4 -4    0    0
     94 -4 -4  4  4 -4 -4 -4 -4    0    0
     95 -4 -4 -4 -4 -5 -5 -5 -5   31   52
     96 -4 -4 -4 -4 -5 -5 -5 -5    0    0
     97 -4 -4  5  5 -4 -4 -4 -4    0    0
     98 -4 -4  5  5 -4 -4 -4 -4    0    0
     99 -4 -4 -4 -4 -4 -4 -4 -4    0    0
    100 -4 -4 -4 -4 -4 -4 -4 -4    0    0
    end
    label values PUBID_1997 vlR0000100
    label values EMP_STATUS_1994_01_XRND vlE0011401
    label values EMP_STATUS_1994_02_XRND vlE0011402
    label values EMP_STATUS_1995_01_XRND vlE0011501
    label def vlE0011501 0 "0: No information reported to account for week; job dates indeterminate", modify
    label def vlE0011501 1 "1: Not associated with an employer, not actively searching for an employer job", modify
    label def vlE0011501 4 "4: Unemployed", modify
    label def vlE0011501 5 "5: Out of the labor force", modify
    label values EMP_STATUS_1995_02_XRND vlE0011502
    label def vlE0011502 0 "0: No information reported to account for week; job dates indeterminate", modify
    label def vlE0011502 1 "1: Not associated with an employer, not actively searching for an employer job", modify
    label def vlE0011502 4 "4: Unemployed", modify
    label def vlE0011502 5 "5: Out of the labor force", modify
    label values YEMP_101104_01_2000 vlR4768500
    label values YEMP_101104_02_2000 vlR4768600
    label values YEMP_101106_01_2000 vlR4769300
    label def vlR4769300 1 "RUN BUSINESS FROM HOME", modify
    label values YEMP_101106_02_2000 vlR4769400
    label values CVC_HOURS_WK_YR_SE_00_XRND vlZ9071800
    label def vlZ9071800 0 "0", modify
    label values CVC_HOURS_WK_YR_SE_01_XRND vlZ9071900
    label def vlZ9071900 0 "0", modify
    So far my problems is:
    1. I have too many variables with different name formats so I think I need to rename my variables.

    I would be grateful if you can give me some guidance on how to proceed with reshape with these type of variables. If you have used NLSY97 before I would also appreciate learning from your experience reshaping this data.

    thank you in advance for your help.

    Alejandra.

  • #2
    You are right that the key difficulty here is to get the names into a consistent format. Below, I show a way to do at least part of that by making all your variable names have a 4-digit year at the end. *However,* my solution crucially depends on the variable names being as you show them here. That is so because the whole trick of what I show below is to find string expressions that distinctly identify only the variable names of interest. From my own experience working with the NLSY, I know that there are huge numbers of variables, with complicated and meaningless names. So, what I show below might not be perfect for your situation, and there might be easier ways depending on the exact list of variable names that you have. All that being said, what I have below shows some technique for renaming the variables into a format that will facilitate -reshape long- . My advice would be to extract much less "mixed" lists of variables, rather than trying to work with all of them at once at you are trying to do year. It would be easier, for example, to only deal with a file containing the variable names with 4 digit years in the middle; then, work on a file with only the 2-digit years, etc. My offering also presumes that you only need your data to be "long" with respect to *year*, and I'm not sure if that is true for you. I'm not dealing with the reshape part here, which should be relatively straightforward.

    /
    Code:
    // variable names with 4 digit year in the middle
    foreach v of varlist * {
       local start = strpos("`v'", "199")  // you only showed 19?? names
       if (`start' != 0 ) {
          local digit4 = substr("`v'", `start', 4)
           local new = subinstr("`v'", "_" + "`digit4'", "", 1) + "_" + "`digit4'"
           di "old: `v'"
           di "new: `new'" _newline(2)
           rename `v' `new'
       }    
    }
    //
    // variable names with 2 digit year in the middle
    foreach v of varlist * {
       local match2 = regexm("`v'", "(.*)(_[0-9][0-9])(_XRND)$")
       if (`match2' == 1) {
          local x1 = regexs(1)
          local x2 = regexs(2)
          local x3 = regexs(3)
          // convert 2 digit to 4 digit, assuming 2000 years those in 2000-20018
          local digit2 = subinstr("`x2'", "_", "", . )  
          if (`digit2' < 18) {
              local suffix = "_20`digit2'"
          }
          else {
             local suffix = "_19`digit2'"
          }
          local new = "`x1'" + "`x3'" + "`suffix'"
          di "old: `v'"
          di "new: `new'" _newline(2)
          rename `v' `new'
       }
    }
    Last edited by Mike Lacy; 16 Jul 2018, 09:53.

    Comment


    • #3
      Wow! It isn't even clear in some cases which part of the variable names is the year. In the YEMP_* variables, is the year the two digit 01/02, or are those perhaps item sequence numbers and the actual year part is 2000? I assume the survey documentation will tell you that.

      Given the pleiomorphic names here, you will probably have to rename each batch of variables one at a time. If you are not familiar with all of the ways that -rename- can do bulk renaming of similar groups of items, take a look at -help rename group-. (This is somewhat like the help files and documentation on date-time variables: there is a lot here and you don't remember the finer points that you seldom use. The key is to just remember that they are there so you can look up the details when you do need them.)

      Now, it is not strictly necessary to have the year at the end of the variable name to use -reshape-. If you read the documentation on the -reshape- command you will see that the @ character can be used in the stubs to tell Stata where to look for the value of the -j()- variable. However, you have the additional problem in your real data of a large number of variables, and creating the list of stubs for -reshape- will be easier if the variables all have their names end with the 4 digit year. So something like this works in the data you show

      Code:
      rename EMP_STATUS_(####)_(##)_XRND EMP_STATUS_(##)[2]_XRND_(####)[1]
      //    I ASSUME THE 2000 AT END OF YEMP* IS ALREADY THE YEAR
      rename CVC_HOURS_WK_YR_SE_(##)_XRND CVC_HOURS_WK_YR_SE_XRND_20(##)
      
      //    CREATE STUBS
      local stubs
      ds *_1995 *_2000
      local stubs1 `r(varlist)'
      foreach s of local stubs1 {
          local stubs `stubs' `=substr("`s'", 1, length("`s'")-4)'
      }
      display `"`stubs'"'
      
      reshape long `stubs', i(PUBID_1997) j(year)
      rename *_ *
      Note: In the above, I assume that PUBID_1997 is not part of a series of variables PUBID_* to be -reshape-d long, but is simply a unique identifying variable that, perhaps has some historical connection to 1997.

      Added: Crossed with #2, which shows a different approach. I have no experience working with NLSY97,and Mike's advice is always reliable, so I think I would go with Mike's advice over mine, unless you run into trouble using it.
      Last edited by Clyde Schechter; 16 Jul 2018, 10:06.

      Comment


      • #4
        Clyde's approach looks a lot simpler, so if it will work, I'd go with it. My approach might be preferable depending on complexity of names. I've only worked with the j() variable at the end of the name, so I'd also defer to him there. I think that he and I would both agree, though, that dealing with files that only contain one kind of variable ("batches") would be easier and less subject to error.
        One last comment, which may be wrong: My recollection of working with the NLSY97 is that the variable names are completely meaningless, with names like "R79407." I don't recall the names having the year indicators in them; rather, that came from the user making such names. Perhaps you have some choices in this regard while using the Investigator webpage, which I never personally used. Don't let this suggestion confuse you, though, as I'm not at all certain my memory was right here.
        And I wish Clyde was right about my advice usually being "reliable." I'd say that's true with about probability = 0.5. <grin>

        Comment


        • #5
          Thank you Clyde and Mike for your useful suggestions.
          What I did to get those names was to run the do-file that comes with the database once you download it, but I think it would be simpler to work with one kind of variable as you suggested.
          My original idea was to download all variables and clean the data so I can use it in more than one paper. However, this has been a challenging database and I think the best is to download only the variables of interest.
          Thank you again for your help!

          Comment

          Working...
          X