Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Weird collapse/reshape: move observations one row below for a variable

    I have a long dataset that contains stays in classes of a series of schools. The data look like this:
    ID school area startdt enddt
    464 KC U16 1/6/2016 2:10 5/4/2016 8:32
    464 KC U15 5/4/2016 8:32 5/25/2016 1:10
    464 KC U16 5/25/2016 1:10 6/29/2016 10:20
    464 KC U2 6/29/2016 10:20 7/2/2016 3:02
    464 KC L11 7/2/2016 3:02 7/26/2016 3:30
    So for example this student stayed in U16 from 1/6/16 to 5/4/16, then he/she stayed in U15 from 5/4/16 to 5/25/16. My goal is to generate moves instead of stays. So the first movement was in 5/4/2016 which is the end date of the first stay. Then I need to create a variable that show the area they were and the are the went to, in this case U16 to U15. I was able to do this by:

    Code:
    bysort ID (startdt): gen  MOVEMENT=enddt[_n-1] 
    format MOVEMENT %tc
    
    bysort ID (startdt): gen school_from=school[_n-1] 
    bysort ID (startdt)): gen area_from =area[_n-1] 
    bysort ID (startdt)): gen school_to=school[_n+1] 
    bysort ID (startdt)): gen area_to=area[_n+1]
    ID school area startdt enddt MOVEMENT school_from area_from school_to area_to
    464 KC U16 1/6/2016 2:10 5/4/2016 8:32 KC U15
    464 KC U15 5/4/2016 8:32 5/25/2016 1:10 4-May-16 KC U16 KC U16
    464 KC U16 5/25/2016 1:10 6/29/2016 10:20 25-May-16 KC U15 KC U2
    464 KC U2 6/29/2016 10:20 7/2/2016 3:02 29-Jun-16 KC U16 KC L11
    464 KC L11 7/2/2016 3:02 7/26/2016 3:30 2-Jul-16 KC U2
    The result looks ok to me. The only problem is that the "to" variables such as school_to are one row before the corresponding "from" variable such as from_to and they should be in the same row. Perhaps this is not the best approach to do this. Any ideas on how to fix this issue or a new approach? One fix could be to just move the values of the "to" variables one row more (_n + 1???).


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int ID str2 school str3 area double startdt str16 enddt int MOVEMENT str2 school_from str3 area_from str4 school_to str8 area_to
    464 "KC" "U16" 1767665400000.0002 "   5/4/2016 8:32"     . ""   ""    "KC"   "U15"     
    464 "KC" "U15" 1777969920000.0002 "  5/25/2016 1:09" 20578 "KC" "U16" "KC"   "U16"     
    464 "KC" "U16" 1779757799999.9998 " 6/29/2016 10:20" 20599 "KC" "U15" "KC"   "U2"      
    464 "KC" "U2"       1.7828148e+12 "   7/2/2016 3:01" 20634 "KC" "U16" "KC"   "L11"     
    464 "KC" "L11" 1783047719999.9998 "  7/26/2016 3:30" 20637 "KC" "U2"  ""     ""        
    452 "DC" "6L"        1.771665e+12 " 5/22/2016 10:02"     . ""   ""    "BKDC" "7L"      
    452 "DC" "7L"  1779530520000.0002 "  6/3/2016 11:00" 20596 "DC" "6L"  "BKDC" "8L"      
    452 "DC" "8L"  1780570859999.9998 "  6/16/2016 2:38" 20608 "DC" "7L"  "BKDC" "8L"      
    452 "DC" "8L"       1781663880000 "."                20621 "DC" "8L"  ""     "        "
    end
    format %tcnn/dd/ccYY_hh:MM startdt
    format %tddd-Mon-YY MOVEMENT
    Thank you!

  • #2
    This is much easier to do if you have school and area as numerically encoded variables with value labels.

    Code:
    set more off
    clear*
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int ID str2 school str3 area double startdt str16 enddt int MOVEMENT
    464 "KC" "U16" 1767665400000.0002 "   5/4/2016 8:32"     .
    464 "KC" "U15" 1777969920000.0002 "  5/25/2016 1:09" 20578
    464 "KC" "U16" 1779757799999.9998 " 6/29/2016 10:20" 20599
    464 "KC" "U2"       1.7828148e+12 "   7/2/2016 3:01" 20634
    464 "KC" "L11" 1783047719999.9998 "  7/26/2016 3:30" 20637
    452 "DC" "6L"        1.771665e+12 " 5/22/2016 10:02"     .
    452 "DC" "7L"  1779530520000.0002 "  6/3/2016 11:00" 20596
    452 "DC" "8L"  1780570859999.9998 "  6/16/2016 2:38" 20608
    452 "DC" "8L"       1781663880000 "."                20621
    end
    format %tcnn/dd/ccYY_hh:MM startdt
    format %tddd-Mon-YY MOVEMENT
    
    //    MAKE SCHOOL AND AREA NUMERIC VARIABLES BY USING encode
    foreach v of varlist school area {
        encode `v', gen(_`v')
        order _`v', before(`v')
        drop `v'
        rename _`v' `v'
    }
    //    CFREATE AN INTEGER TIME VARIABLE
    by ID (startdt), sort: gen seq = _n
    
    xtset ID seq
    gen school_from = L.school
    gen area_from = L.area
    gen school_to = school
    gen area_to = area
    label values school_* _school
    label values area_* _area
    
    list, noobs sepby(ID)
    If you really need the results to have school* and area* as string variables, you can always -decode- them at the end.

    Comment


    • #3
      Thank you so much Clyde! I was exploring your codes. So you made the my data time series/panel by the xtset command. Then the L. lag one time before. This is great! After your command, the first row is useless, tight? I just deleted it. One thing that I do not understand is the encode macro syntax. When one just utilizes the encode command, it creates numeric values for the categories of a string variable. But when you used this new variable to create other variables, the label of the values disappear. I guess you macro encode take care of this but I do not understand it. I barely have used macros. They seem complicated. I started reading one article of macros and foreach, forvalue, etc by Nick Cox but founded a little complicated. It would great if Stata would offer a training in just macros for beginners.

      Thank you again!


      Comment


      • #4
        To understand Clyde's code you need to follow it through step by step:


        Code:
        foreach v of varlist school area {    
             encode `v', gen(_`v')    
             order _`v', before(`v')    
             drop `v'    
             rename _`v' `v'
        }
        becomes

        Code:
        encode school, gen(_school)    
        order _school, before(school)    
        drop school    
        rename _school school        
        
        encode area, gen(_area)    
        order _area, before(area)    
        drop area    
        rename _area area
        That's all that macros are doing here: letting you loop over a set of variable names, here just two of them

        Comment


        • #5
          When people want basic documentation, most of the time it then turns out that they are unaware of [U] or haven't tried it reading slowly and carefully.

          In the menu go Help > PDF Documentation > User's Guide.

          That's where most of the more experienced people here picked up most of their understanding, or so I guess from their comments over the years.

          Comment


          • #6
            Thank you so much Nick!

            Comment


            • #7
              Clyde Schechter Hi Clyde,

              Let me know if I should open another post of if this relates to my original post. I have a panel data and need to create a new location variable when the location is in era 1. My code does work because I loose the label values.
              ID time place era
              12357 1 RN 0
              12357 2 RN 0
              12357 3 FMO 1
              12357 4 FMO 1
              12357 5 FMO 1
              12357 6 DC 2
              12357 7 DC 2
              12357 8 DC 2
              Code:
              encode place, gen(placeN)
              gen place1=placeN if era==1
              bysort id: egen place1Final=max(place1)

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input int id byte time str3 place byte era long placeN float(place1 place1Final)
              12357 1 "RN"  0 3 . 2
              12357 2 "RN"  0 3 . 2
              12357 3 "FMO" 1 2 2 2
              12357 4 "FMO" 1 2 2 2
              12357 5 "FMO" 1 2 2 2
              12357 6 "DC"  2 1 . 2
              12357 7 "DC"  2 1 . 2
              12357 8 "DC"  2 1 . 2
              end
              label values placeN placeN
              label def placeN 1 "DC", modify
              label def placeN 2 "FMO", modify
              label def placeN 3 "RN", modify
              I think I need to save the value label in a macro.

              BEst,
              MArvin

              Comment


              • #8
                This is really a different topic and it would have been better to start a new thread, but since I don't think it will end up being a long chain of posts, let's just leave it here.

                So the problem is just that when you -gen- or -egen- a new variable from an existing one, the new variable does not take on the labeling of the original variable. You have to explicitly assign that label to it. So all you need to do to make the code you wrote produce the result you are looking for is add
                Code:
                label values place1 place1Final placeN
                at the bottom.

                As an aside, you don't need the intermediate variable place1. You could get place1Final directly with
                Code:
                bysort id: egen place1Final=max(cond(era == 1, placeN, .))
                And as another aside: if you use -clonevar- instead of -gen- to create a new variable then the variable label and value label, and any variable characteristics that may be defined, all get copied to the new variable. So when you really want to reproduce every aspect of an existing variable, -clonevar- is a lot simpler than doing it piece by piece. Also note that at some fairly recent point in Stata's evolution, it became possible to assign an existing value label within generate. So, for example, when you created place1 (which, as I pointed out you didn't need to do--I'm just using this as an illustration): had you done it as -gen place1:placeN = placeN if era == 1-, then Stata would have created it with the value label placeN already attached. This same device, however, does not work with -egen-.

                Finally, you seem to assume that the value of place will be the same for all observations with era == 1. That is true in your example, and if it is true in your data as a whole, that's fine. But if there are exceptions, be aware that place1 will take on more than one value, and place1Final, whether generated by your code or by mine, will take on the value of the largest among those, which may or may not be what you want. Since data sets have a tendency to contain nasty surprises around issues like this, you might want to check that assumptioni before you proceed. You can do this with:
                Code:
                by id era (place), sort: assert place[1] == place[_N] if era == 1
                If your assumption is correct, Stata will run this with no output and proceed to your next commands. If the assumption is wrong, then Stata will throw an error message and halt.

                Comment


                • #9
                  Thank you so much Clyde Schechter

                  You absolute right about you last point (assuming that all values in era 1 are the same). In reality they are not. I would like to pick just the first value in the era. Similar to selecting the place in era==1, I need to select the age of the student when he/she entered era 1 but if age varies in era==1 just choose the first one. My take was to create an "order" variable per era.
                  Code:
                  bysort id era:gen order=_n
                  and then choose the number 1. But perhaps it is a faster way to do this such as include [1] somewhere in you original code
                  Code:
                  bysort id: egen place1Final=max(cond(era == 1, placeN, .))
                  . What would be the best solution?

                  I am sorry for the late reply but I was trying to figure this out.

                  Thanks,
                  Marvin

                  Comment


                  • #10
                    This will select the first (according to the current sort order of the data) place of era 1 for each id.

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input int id byte time str3 place byte era
                    12357 1 "RN"  0
                    12357 2 "RN"  0
                    12357 3 "FMO" 1
                    12357 4 "QRS" 1
                    12357 5 "XYZ" 1
                    12357 6 "DC"  2
                    12357 7 "DC"  2
                    12357 8 "DC"  2
                    end
                    
                    encode place, gen(placeN)
                    
                    
                    gen long order = _n    // RECORD SORT ORDER OF DATA
                    
                    //    CREATE A VARIABLE WITH FIRST OBS OF ERA 1
                    by id era (order), sort: gen first_place_era_1:placeN = placeN[1] if era == 1
                    by id (first_place_era_1), sort: replace first_place_era_1 = first_place_era_1[1]
                    sort order
                    Notes:
                    1. I modified your earlier data so that there would be more than one place mentioned in era = 1. Also, I left place as a string variable and didn't bother with encoding.
                    2. Note that the -by id era(order), sort: gen first_place_era_1:placeN = ...- command uses the facility for creating a new variable with a pre-existing label already attached. I don't recall at which version this was introduced to Stata. If you are running an older version of Stata, this command may give you a syntax error. If that happens, just take out the -:placeN- from the command, and at the end of the code add -label vales first_place_era_1 placeN-.

                    The code to select the first age would be entirely analogous, and I will leave it to you to write out the details.

                    Comment

                    Working...
                    X