Matching strings across waves in longitudinal data

Anne Todd

Join Date: Dec 2018
Posts: 157

Matching strings across waves in longitudinal data

20 Feb 2024, 11:10

Hi all, I am trying to wrangle some survey data (which I've reproduced below with entirely fabricated made up data that mirrors the structure of my actual data). My goal is to figure out two things.

The first is to create a variable with 3 mutually exclusive levels: (1) in their final survey entry, the respondent is still in the same program as Wave 1; (2) in their final survey entry, the person is in a different program from Wave 1; (3) in their final survey entry, the person is not in any program (indicated by their program entry being missing). My first pass at this used strmatch() but I struggled to figure out how to specify comparing a respondent's first survey entry to their final survey entry.

The second is to flag the wave and date that they last appear in the survey. You'll notice that when someone doesn't take a survey at all, they have no row for that date, and that the final wave is wave 5. There are also some waves (wave 3 in this example data) where there were multiple surveys within that wave. Put simply, I just want to know when their final survey response took place.

Example data:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int person_id byte survey_wave str9 survey_date str11 program
100 1 "3/3/2010"  "Springfield"
100 2 "4/1/2010"  "Springfield"
100 3 "6/1/2010"  "Springfield"
100 3 "6/5/2010"  "Springfield"
100 3 "6/20/2010" "Springfield"
100 4 "7/5/2010"  "Springfield"
100 5 "1/1/2011"  "Springfield"
101 1 "3/3/2010"  "Springfield"
101 3 "6/1/2010"  "Springfield"
101 3 "6/4/2010"  ""           
101 4 "7/5/2010"  "Springfield"
101 5 "1/1/2011"  "Springfield"
102 1 "3/3/2010"  "Joyce"      
102 2 "4/1/2010"  "Joyce"      
102 3 "6/1/2010"  "Joyce"      
102 3 "6/5/2010"  "Joyce"      
102 3 "6/20/2010" "Joyce"      
103 1 "3/3/2010"  "Capitol"    
103 2 "4/1/2010"  "Capitol"    
103 3 "6/1/2010"  "Capitol"    
103 3 "6/4/2010"  "Capitol"    
103 4 "7/5/2010"  "Green"      
103 5 "1/1/2011"  "Green"      
104 1 "3/3/2010"  "Target"     
104 2 "4/1/2010"  "Breakers"   
104 3 "6/1/2010"  "Breakers"   
104 3 "6/4/2010"  "Breakers"   
104 3 "6/20/2010" "Breakers"   
104 3 "6/25/2010" "Breakers"   
104 4 "7/5/2010"  "Breakers"   
104 5 "1/1/2011"  ""           
105 1 "3/3/2010"  "Capitol"    
105 2 "4/1/2010"  "Capitol"    
105 4 "7/5/2010"  "Springfield"
106 1 "3/4/2010"  "Mill Park"  
106 2 "4/1/2010"  "Mill Park"  
106 3 "6/1/2010"  ""           
106 3 "6/4/2010"  ""           
end

Tags: None

George Ford

Join Date: Aug 2014
Posts: 3044

20 Feb 2024, 11:53

Code:

bys person_id: g program1 = program[1]
g same = program==program1
g diff = program != program1
g out = program==""

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#3

20 Feb 2024, 12:22

The first command in #2, -bys person_id: g program1 = program[1]- is potentially wrong. Because person_id does not uniquely identify observations in this data, sorting on person_id may randomize the order of the data within person_id. That's a big problem since the whole point is to assure that we identify the first program. -bys person_id (wave): g program1 = program[1]- will safely assure that the observations remain sorted by wave within person.
1 like
Comment
Anne Todd

Join Date: Dec 2018

Posts: 157
#4

20 Feb 2024, 12:34

Thanks George Ford and Clyde Schechter. One small thing: the final line seems to just highlight any missings, but in the case of, for example, respondent 106, they have multiple missings and I only want to highlight the last response--while simultaneously creating a separate indicator for when that final missing is itself blank. Any intermediate missings can be ignored.
Comment

George Ford

Join Date: Aug 2014
Posts: 3044

20 Feb 2024, 12:47

Setting aside Clyde's good point,

Code:

egen wavemax = max(survey_wave), by(person_id)
replace same = 0 if survey_wave != maxwave
replace diff = 0 if survey_wave != maxwave
replace out = 0 if survey_wave != maxwave

Code:

replace same = . if survey_wave != maxwave
replace diff = . if survey_wave != maxwave
replace out = . if survey_wave != maxwave

Comment

George Ford

Join Date: Aug 2014
Posts: 3044

20 Feb 2024, 13:54

Again, if Clyde is right about replicated person_id, you've got to deal with that.

To check that, try

Code:

xtset personal_id wave

and see if you get an error.

Here's a condensed version.

Code:

egen wavemax = max(survey_wave), by(person_id)
bys person_id: g program1 = program[1]
g same = program==program1 if survey_wave == wavemax
g diff = program != program1 if survey_wave == wavemax
g out = program=="" if survey_wave == wavemax

Comment

Anne Todd

Join Date: Dec 2018

Posts: 157
#7

20 Feb 2024, 14:56

Thanks George Ford. It seems like Clyde's proposed solution (bys person_id (wave): g program1 = program[1]) deals with the potential randomization within person_id.
Comment

Announcement

Matching strings across waves in longitudinal data

Comment

Comment

Comment

Comment

Comment

Comment