Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching strings across waves in longitudinal data

    Hi all, I am trying to wrangle some survey data (which I've reproduced below with entirely fabricated made up data that mirrors the structure of my actual data). My goal is to figure out two things.

    The first is to create a variable with 3 mutually exclusive levels: (1) in their final survey entry, the respondent is still in the same program as Wave 1; (2) in their final survey entry, the person is in a different program from Wave 1; (3) in their final survey entry, the person is not in any program (indicated by their program entry being missing). My first pass at this used strmatch() but I struggled to figure out how to specify comparing a respondent's first survey entry to their final survey entry.

    The second is to flag the wave and date that they last appear in the survey. You'll notice that when someone doesn't take a survey at all, they have no row for that date, and that the final wave is wave 5. There are also some waves (wave 3 in this example data) where there were multiple surveys within that wave. Put simply, I just want to know when their final survey response took place.


    Example data:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int person_id byte survey_wave str9 survey_date str11 program
    100 1 "3/3/2010"  "Springfield"
    100 2 "4/1/2010"  "Springfield"
    100 3 "6/1/2010"  "Springfield"
    100 3 "6/5/2010"  "Springfield"
    100 3 "6/20/2010" "Springfield"
    100 4 "7/5/2010"  "Springfield"
    100 5 "1/1/2011"  "Springfield"
    101 1 "3/3/2010"  "Springfield"
    101 3 "6/1/2010"  "Springfield"
    101 3 "6/4/2010"  ""           
    101 4 "7/5/2010"  "Springfield"
    101 5 "1/1/2011"  "Springfield"
    102 1 "3/3/2010"  "Joyce"      
    102 2 "4/1/2010"  "Joyce"      
    102 3 "6/1/2010"  "Joyce"      
    102 3 "6/5/2010"  "Joyce"      
    102 3 "6/20/2010" "Joyce"      
    103 1 "3/3/2010"  "Capitol"    
    103 2 "4/1/2010"  "Capitol"    
    103 3 "6/1/2010"  "Capitol"    
    103 3 "6/4/2010"  "Capitol"    
    103 4 "7/5/2010"  "Green"      
    103 5 "1/1/2011"  "Green"      
    104 1 "3/3/2010"  "Target"     
    104 2 "4/1/2010"  "Breakers"   
    104 3 "6/1/2010"  "Breakers"   
    104 3 "6/4/2010"  "Breakers"   
    104 3 "6/20/2010" "Breakers"   
    104 3 "6/25/2010" "Breakers"   
    104 4 "7/5/2010"  "Breakers"   
    104 5 "1/1/2011"  ""           
    105 1 "3/3/2010"  "Capitol"    
    105 2 "4/1/2010"  "Capitol"    
    105 4 "7/5/2010"  "Springfield"
    106 1 "3/4/2010"  "Mill Park"  
    106 2 "4/1/2010"  "Mill Park"  
    106 3 "6/1/2010"  ""           
    106 3 "6/4/2010"  ""           
    end

  • #2
    Code:
    bys person_id: g program1 = program[1]
    g same = program==program1
    g diff = program != program1
    g out = program==""

    Comment


    • #3
      The first command in #2, -bys person_id: g program1 = program[1]- is potentially wrong. Because person_id does not uniquely identify observations in this data, sorting on person_id may randomize the order of the data within person_id. That's a big problem since the whole point is to assure that we identify the first program. -bys person_id (wave): g program1 = program[1]- will safely assure that the observations remain sorted by wave within person.

      Comment


      • #4
        Thanks George Ford and Clyde Schechter. One small thing: the final line seems to just highlight any missings, but in the case of, for example, respondent 106, they have multiple missings and I only want to highlight the last response--while simultaneously creating a separate indicator for when that final missing is itself blank. Any intermediate missings can be ignored.

        Comment


        • #5
          Setting aside Clyde's good point,

          Code:
          egen wavemax = max(survey_wave), by(person_id)
          replace same = 0 if survey_wave != maxwave
          replace diff = 0 if survey_wave != maxwave
          replace out = 0 if survey_wave != maxwave
          or

          Code:
          replace same = . if survey_wave != maxwave
          replace diff = . if survey_wave != maxwave
          replace out = . if survey_wave != maxwave

          Comment


          • #6
            Again, if Clyde is right about replicated person_id, you've got to deal with that.

            To check that, try

            Code:
            xtset personal_id wave
            and see if you get an error.

            Here's a condensed version.

            Code:
            egen wavemax = max(survey_wave), by(person_id)
            bys person_id: g program1 = program[1]
            g same = program==program1 if survey_wave == wavemax
            g diff = program != program1 if survey_wave == wavemax
            g out = program=="" if survey_wave == wavemax

            Comment


            • #7
              Thanks George Ford. It seems like Clyde's proposed solution (bys person_id (wave): g program1 = program[1]) deals with the potential randomization within person_id.

              Comment

              Working...
              X