Weird collapse/reshape: move observations one row below for a variable

Marvin Aliaga

Join Date: Feb 2015
Posts: 255

Weird collapse/reshape: move observations one row below for a variable

11 Aug 2016, 13:48

I have a long dataset that contains stays in classes of a series of schools. The data look like this:

ID	school	area	startdt	enddt
464	KC	U16	1/6/2016 2:10	5/4/2016 8:32
464	KC	U15	5/4/2016 8:32	5/25/2016 1:10
464	KC	U16	5/25/2016 1:10	6/29/2016 10:20
464	KC	U2	6/29/2016 10:20	7/2/2016 3:02
464	KC	L11	7/2/2016 3:02	7/26/2016 3:30

So for example this student stayed in U16 from 1/6/16 to 5/4/16, then he/she stayed in U15 from 5/4/16 to 5/25/16. My goal is to generate moves instead of stays. So the first movement was in 5/4/2016 which is the end date of the first stay. Then I need to create a variable that show the area they were and the are the went to, in this case U16 to U15. I was able to do this by:

Code:

bysort ID (startdt): gen  MOVEMENT=enddt[_n-1] 
format MOVEMENT %tc

bysort ID (startdt): gen school_from=school[_n-1] 
bysort ID (startdt)): gen area_from =area[_n-1] 
bysort ID (startdt)): gen school_to=school[_n+1] 
bysort ID (startdt)): gen area_to=area[_n+1]

ID	school	area	startdt	enddt	MOVEMENT	school_from	area_from	school_to	area_to
464	KC	U16	1/6/2016 2:10	5/4/2016 8:32				KC	U15
464	KC	U15	5/4/2016 8:32	5/25/2016 1:10	4-May-16	KC	U16	KC	U16
464	KC	U16	5/25/2016 1:10	6/29/2016 10:20	25-May-16	KC	U15	KC	U2
464	KC	U2	6/29/2016 10:20	7/2/2016 3:02	29-Jun-16	KC	U16	KC	L11
464	KC	L11	7/2/2016 3:02	7/26/2016 3:30	2-Jul-16	KC	U2

The result looks ok to me. The only problem is that the "to" variables such as school_to are one row before the corresponding "from" variable such as from_to and they should be in the same row. Perhaps this is not the best approach to do this. Any ideas on how to fix this issue or a new approach? One fix could be to just move the values of the "to" variables one row more (_n + 1???).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int ID str2 school str3 area double startdt str16 enddt int MOVEMENT str2 school_from str3 area_from str4 school_to str8 area_to
464 "KC" "U16" 1767665400000.0002 "   5/4/2016 8:32"     . ""   ""    "KC"   "U15"     
464 "KC" "U15" 1777969920000.0002 "  5/25/2016 1:09" 20578 "KC" "U16" "KC"   "U16"     
464 "KC" "U16" 1779757799999.9998 " 6/29/2016 10:20" 20599 "KC" "U15" "KC"   "U2"      
464 "KC" "U2"       1.7828148e+12 "   7/2/2016 3:01" 20634 "KC" "U16" "KC"   "L11"     
464 "KC" "L11" 1783047719999.9998 "  7/26/2016 3:30" 20637 "KC" "U2"  ""     ""        
452 "DC" "6L"        1.771665e+12 " 5/22/2016 10:02"     . ""   ""    "BKDC" "7L"      
452 "DC" "7L"  1779530520000.0002 "  6/3/2016 11:00" 20596 "DC" "6L"  "BKDC" "8L"      
452 "DC" "8L"  1780570859999.9998 "  6/16/2016 2:38" 20608 "DC" "7L"  "BKDC" "8L"      
452 "DC" "8L"       1781663880000 "."                20621 "DC" "8L"  ""     "        "
end
format %tcnn/dd/ccYY_hh:MM startdt
format %tddd-Mon-YY MOVEMENT

Thank you!

Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29963

11 Aug 2016, 15:32

This is much easier to do if you have school and area as numerically encoded variables with value labels.

Code:

set more off
clear*

* Example generated by -dataex-. To install: ssc install dataex
clear
input int ID str2 school str3 area double startdt str16 enddt int MOVEMENT
464 "KC" "U16" 1767665400000.0002 "   5/4/2016 8:32"     .
464 "KC" "U15" 1777969920000.0002 "  5/25/2016 1:09" 20578
464 "KC" "U16" 1779757799999.9998 " 6/29/2016 10:20" 20599
464 "KC" "U2"       1.7828148e+12 "   7/2/2016 3:01" 20634
464 "KC" "L11" 1783047719999.9998 "  7/26/2016 3:30" 20637
452 "DC" "6L"        1.771665e+12 " 5/22/2016 10:02"     .
452 "DC" "7L"  1779530520000.0002 "  6/3/2016 11:00" 20596
452 "DC" "8L"  1780570859999.9998 "  6/16/2016 2:38" 20608
452 "DC" "8L"       1781663880000 "."                20621
end
format %tcnn/dd/ccYY_hh:MM startdt
format %tddd-Mon-YY MOVEMENT

//    MAKE SCHOOL AND AREA NUMERIC VARIABLES BY USING encode
foreach v of varlist school area {
    encode `v', gen(_`v')
    order _`v', before(`v')
    drop `v'
    rename _`v' `v'
}
//    CFREATE AN INTEGER TIME VARIABLE
by ID (startdt), sort: gen seq = _n

xtset ID seq
gen school_from = L.school
gen area_from = L.area
gen school_to = school
gen area_to = area
label values school_* _school
label values area_* _area

list, noobs sepby(ID)

If you really need the results to have school* and area* as string variables, you can always -decode- them at the end.

Comment

Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#3

12 Aug 2016, 09:13

Thank you so much Clyde! I was exploring your codes. So you made the my data time series/panel by the xtset command. Then the L. lag one time before. This is great! After your command, the first row is useless, tight? I just deleted it. One thing that I do not understand is the encode macro syntax. When one just utilizes the encode command, it creates numeric values for the categories of a string variable. But when you used this new variable to create other variables, the label of the values disappear. I guess you macro encode take care of this but I do not understand it. I barely have used macros. They seem complicated. I started reading one article of macros and foreach, forvalue, etc by Nick Cox but founded a little complicated. It would great if Stata would offer a training in just macros for beginners.

Thank you again!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35451

12 Aug 2016, 09:26

To understand Clyde's code you need to follow it through step by step:

Code:

foreach v of varlist school area {    
     encode `v', gen(_`v')    
     order _`v', before(`v')    
     drop `v'    
     rename _`v' `v'
}

becomes

Code:

encode school, gen(_school)    
order _school, before(school)    
drop school    
rename _school school        

encode area, gen(_area)    
order _area, before(area)    
drop area    
rename _area area

That's all that macros are doing here: letting you loop over a set of variable names, here just two of them

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35451
#5

12 Aug 2016, 10:06

When people want basic documentation, most of the time it then turns out that they are unaware of [U] or haven't tried it reading slowly and carefully.

In the menu go Help > PDF Documentation > User's Guide.

That's where most of the more experienced people here picked up most of their understanding, or so I guess from their comments over the years.
1 like
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#6

12 Aug 2016, 12:14

Thank you so much Nick!
Comment

Marvin Aliaga

Join Date: Feb 2015
Posts: 255

27 Sep 2016, 14:59

Clyde Schechter Hi Clyde,

Let me know if I should open another post of if this relates to my original post. I have a panel data and need to create a new location variable when the location is in era 1. My code does work because I loose the label values.

ID	time	place	era
12357	1	RN	0
12357	2	RN	0
12357	3	FMO	1
12357	4	FMO	1
12357	5	FMO	1
12357	6	DC	2
12357	7	DC	2
12357	8	DC	2

Code:

encode place, gen(placeN)
gen place1=placeN if era==1
bysort id: egen place1Final=max(place1)

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int id byte time str3 place byte era long placeN float(place1 place1Final)
12357 1 "RN"  0 3 . 2
12357 2 "RN"  0 3 . 2
12357 3 "FMO" 1 2 2 2
12357 4 "FMO" 1 2 2 2
12357 5 "FMO" 1 2 2 2
12357 6 "DC"  2 1 . 2
12357 7 "DC"  2 1 . 2
12357 8 "DC"  2 1 . 2
end
label values placeN placeN
label def placeN 1 "DC", modify
label def placeN 2 "FMO", modify
label def placeN 3 "RN", modify

I think I need to save the value label in a macro.

BEst,
MArvin

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29963
#8

27 Sep 2016, 15:17

This is really a different topic and it would have been better to start a new thread, but since I don't think it will end up being a long chain of posts, let's just leave it here.

So the problem is just that when you -gen- or -egen- a new variable from an existing one, the new variable does not take on the labeling of the original variable. You have to explicitly assign that label to it. So all you need to do to make the code you wrote produce the result you are looking for is add

Code:

label values place1 place1Final placeN

at the bottom.

As an aside, you don't need the intermediate variable place1. You could get place1Final directly with

Code:

bysort id: egen place1Final=max(cond(era == 1, placeN, .))

And as another aside: if you use -clonevar- instead of -gen- to create a new variable then the variable label and value label, and any variable characteristics that may be defined, all get copied to the new variable. So when you really want to reproduce every aspect of an existing variable, -clonevar- is a lot simpler than doing it piece by piece. Also note that at some fairly recent point in Stata's evolution, it became possible to assign an existing value label within generate. So, for example, when you created place1 (which, as I pointed out you didn't need to do--I'm just using this as an illustration): had you done it as -gen place1:placeN = placeN if era == 1-, then Stata would have created it with the value label placeN already attached. This same device, however, does not work with -egen-.

Finally, you seem to assume that the value of place will be the same for all observations with era == 1. That is true in your example, and if it is true in your data as a whole, that's fine. But if there are exceptions, be aware that place1 will take on more than one value, and place1Final, whether generated by your code or by mine, will take on the value of the largest among those, which may or may not be what you want. Since data sets have a tendency to contain nasty surprises around issues like this, you might want to check that assumptioni before you proceed. You can do this with:

Code:

by id era (place), sort: assert place[1] == place[_N] if era == 1

If your assumption is correct, Stata will run this with no output and proceed to your next commands. If the assumption is wrong, then Stata will throw an error message and halt.
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#9

28 Sep 2016, 13:34

Thank you so much Clyde Schechter

You absolute right about you last point (assuming that all values in era 1 are the same). In reality they are not. I would like to pick just the first value in the era. Similar to selecting the place in era==1, I need to select the age of the student when he/she entered era 1 but if age varies in era==1 just choose the first one. My take was to create an "order" variable per era.

Code:

bysort id era:gen order=_n

and then choose the number 1. But perhaps it is a faster way to do this such as include [1] somewhere in you original code

Code:

bysort id: egen place1Final=max(cond(era == 1, placeN, .))

. What would be the best solution?

I am sorry for the late reply but I was trying to figure this out.

Thanks,
Marvin
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29963
#10

28 Sep 2016, 14:40

This will select the first (according to the current sort order of the data) place of era 1 for each id.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int id byte time str3 place byte era 12357 1 "RN" 0 12357 2 "RN" 0 12357 3 "FMO" 1 12357 4 "QRS" 1 12357 5 "XYZ" 1 12357 6 "DC" 2 12357 7 "DC" 2 12357 8 "DC" 2 end encode place, gen(placeN) gen long order = _n // RECORD SORT ORDER OF DATA // CREATE A VARIABLE WITH FIRST OBS OF ERA 1 by id era (order), sort: gen first_place_era_1:placeN = placeN[1] if era == 1 by id (first_place_era_1), sort: replace first_place_era_1 = first_place_era_1[1] sort order

Notes:
1. I modified your earlier data so that there would be more than one place mentioned in era = 1. Also, I left place as a string variable and didn't bother with encoding.
2. Note that the -by id era(order), sort: gen first_place_era_1:placeN = ...- command uses the facility for creating a new variable with a pre-existing label already attached. I don't recall at which version this was introduced to Stata. If you are running an older version of Stata, this command may give you a syntax error. If that happens, just take out the -:placeN- from the command, and at the end of the code add -label vales first_place_era_1 placeN-.

The code to select the first age would be entirely analogous, and I will leave it to you to write out the details.
Comment

Announcement

Weird collapse/reshape: move observations one row below for a variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment