combining retrospective and prospective information

Yara Issa

Join Date: Nov 2020

Posts: 42
#1

combining retrospective and prospective information

14 Jun 2021, 15:09

Hello all,

I have a question please on how to combine retrospective and prospective information using Stata. I have an observed period of 10 years from 2009-2019 (Wave 1 to 9) in a panel sittings. I also have retrospective variables related to employment history and marital history which has time-invariant values for each person in the observed period (such as; ever married, age at first marriage, age at first job after leaving school, etc.)
These information were collected in Wave 1 and Wave 5. I want to combine these information together and thus use panel data analysis.
Can anyone help me with Stata commands or share any knowledge on hoe this should be done.

Kind regards,
Yara
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

14 Jun 2021, 15:39

First, it matters whether the retrospective variables collected in Wave 1 and Wave 5 are the same, overlapping, or altogether different. If they are the same or overlapping, then first task is to verify that, except for perhaps filling in missing values, the data are consistent in those two waves, or, if they are different, that they are due to something actually changing. (And, if inconsistencies are found, they need to be reconciled before proceeding farther.) If the retrospective data are consistent from wave 1 to 5, appending those two data sets, and then collapsing to a single observation per participant should be done first, and then the result merged with the prospective data.

If the variables obtained in waves 1 and 5 are completely different, then each can be separately merged into the prospective survey data.

If you want specific commands, you have to show example data: imaginary code written for imaginary data usually doesn't work well when confronted with the real data.
Comment
Yara Issa

Join Date: Nov 2020

Posts: 42
#3

14 Jun 2021, 16:05

Dear Clyde,

Thank you very much for your replay. The retrospective variables collected in Wave 1 and Wave 5 are the same. These variables refer to economic activity histories collected in Wave 1 (2009-2010) and Wave 5 (2013-2015) and lifetime marital and relationship histories (collected in Wave 1 (2009-2010; for the new entrants) and Wave 6 (2014-2016). so as I understood from your comment I should append those two data sets, and then collapsing to a single observation per participant and then the result merged with the prospective data.
I am not sure what command to use for "collapsing to a single observation per participant" if you can help with that please.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

14 Jun 2021, 18:14

See -help collapse-. After scanning that, click on the Remarks and examples link that's about in the middle of the page, and read a few of those so you get a sense of how it works.

But you skipped over an important part: the consistency check. What will you do if somebody's age at first marriage is said to be 24 at wave 1 and 27 at wave 2? And you will be very lucky, indeed, if nothing like that ever happens. On the assumption that there is no way to actually go back and get "the truth" from the people who curate the survey data, you will need to choose rules about how to resolve conflicts when they appear.

The first step would be to look at some descriptive statistics and remove impossible/highly implausible values. For example, if somebody's age at first marriage is reported as 5, that's almost surely incorrect and you should replace results like that with missing values. Similarly if somebody is said to have completed 125 years of education. With the surviving values, -collapse- will help you select which you want according to simple rules. You might always prefer Wave 1 (assuming it's not missing). You might always prefer wave 5. You might prefer the older age. You might prefer the younger age. You might decide to "split the difference" and average the ages reported. All of these have corresponding operators available in the -collapse- command: with the data starting out sorted on participant ID and wave number, these correspond to (firstnm), (lastnm), (max), (min), (mean), respectively. Or you might have reason to apply more complicated rules that might not be representable directly in -collapse-. In that case you might have to calculate new variables that reflect those rules before proceeding with -collapse-. These decisions will be based on your understanding of how the data were gathered and the implications that has for which of two conflicting values would be most likely to be correct (or most likely to be useful).

Last edited by Clyde Schechter; 14 Jun 2021, 18:18.
1 like
Comment
Yara Issa

Join Date: Nov 2020

Posts: 42
#5

23 Jun 2021, 14:14

Dear Clyde,

thanes you so much for your replay. I am on the processes of checking the consistency of the data. I have the employment history in wave 1 for the 25% of the sample. Then the rest were collected in wave 5. I am using 3 variables for the employment history taking from wave 1 and 5. so would it be correct to merge the variable I am using from wave 1 and 5 and then collapse them by pidp? or should I collapse each variable first by pidp and then merge them together?
Comment
Yara Issa

Join Date: Nov 2020

Posts: 42
#6

23 Jun 2021, 14:25

for the consistency of the answer I am using this command:
bys pidp: egen check=mean(agejob_dv)

gen check1=check-agejob_dv

tab pidp if check1!=. & check1!=0

where agejob is the variable capture the age at first job after leaving full time education. is this command correct to check the answers in wave 1 and 5 or do I need to specify the waves in the command?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

23 Jun 2021, 15:06

Re #5. First -append- waves 1 and 5 together into a single data set. Then do the -collapse-. Then -merge- that with the all-waves data set.

Re #6. That looks good.
Comment
Yara Issa

Join Date: Nov 2020

Posts: 42
#8

23 Jun 2021, 15:09

Great thank you so much
Comment
Yara Issa

Join Date: Nov 2020

Posts: 42
#9

30 Jun 2021, 16:58

Dear Clyde,

I want to calculate the mortality rate for men who aged 45+ at wave 1 and see the proportion of those who died during the follow up period (till wave 9). I have a variable (deceased) that tells if the respondent reported dead during the survey. I am not sure about the command for this task. also is there any way to present the result in a chart to make it easy to read?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

30 Jun 2021, 17:16

Something like this:

Code:

by respondent_id (wave), sort: egen died_during_follow_up = max(deceased) by respondent_id (wave): egen male_age_ge_45_wave_1 = (age[1] >= 45) & !missing(age[1]) & sex == "Male" egen respondent_flag = tag(respondent_id) tab died_during_follow_up if male_age_ge_45_wave_1 & respondent_flag

This will give you the proportion of men age 45+ at wave 1 who died during follow-up.

Note: You have not shown example data, so I don't know the actual names of your variables, and I'm making some assumptions here about how your data look and are organized. For further help with writing code, please use the -dataex- command to show example data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Yara Issa

Join Date: Nov 2020
Posts: 42

#11

30 Jun 2021, 18:12

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long pidp float wave int age_dv byte dcsedfl_dv
  15645 .  . 2
  22445 .  . 2
  29925 .  . 2
  76165 .  . 2
 184965 .  . 2
 223725 .  . 2
 261125 .  . 2
 274047 .  . 2
 274731 .  . 2
 280165 .  . 2
 299885 .  . 2
 333205 .  . 2
 387605 .  . 2
 419567 .  . 2
 420251 .  . 2
 469205 .  . 2
 496405 .  . 2
 499127 .  . 2
 499811 .  . 2
 509327 .  . 2
 510011 .  . 2
 537205 .  . 2
 541285 .  . 2
 541965 .  . 2
 571887 .  . 2
 599765 .  . 2
 665045 .  . 2
 688847 .  . 2
 689531 .  . 2
 690215 .  . 2
 732365 .  . 2
 760925 .  . 2
 813285 2 40 2
 813285 3 41 2
 813285 4 42 2
 813285 5 43 2
 813285 6 44 2
 813285 7 45 2
 813285 8 46 2
 813285 9 47 2
 827565 .  . 2
 842525 .  . 2
 847287 .  . 2
 847971 .  . 2
 848659 .  . 2
 850005 .  . 2
 850027 .  . 2
 933647 .  . 2
 934331 .  . 2
 937047 .  . 2
 940445 .  . 2
 945205 .  . 2
 952005 .  . 2
 952685 .  . 2
 956765 2 55 2
 956765 3 56 2
 956765 4 57 2
 956765 5 58 2
 956765 6 59 2
 956765 7 60 2
 956765 8 61 2
 956765 9 62 2
 966967 .  . 2
 986685 .  . 2
 987365 .  . 2
 993491 .  . 2
1039047 .  . 2
1039735 .  . 2
1114525 2 36 2
1126765 .  . 2
1137647 .  . 2
1138331 .  . 2
1139703 .  . 2
1217402 .  . 2
1275687 .  . 2
1276367 .  . 2
1371567 .  . 2
1372251 .  . 2
1372935 .  . 2
1390605 .  . 2
1448407 .  . 2
1449091 .  . 2
1458607 .  . 2
1459291 .  . 2
1459975 .  . 2
1488527 .  . 2
1489211 .  . 2
1558565 .  . 2
1576245 .  . 2
1587125 .  . 2
1697285 .  . 2
1731965 .  . 2
1740127 .  . 2
1740811 .  . 2
1793847 .  . 2
1794531 .  . 2
1833965 2 45 2
1833965 3 46 2
1833965 4 47 2
1833965 5 48 2
end
label values age_dv a_age_dv
label values dcsedfl_dv dcsedfl_dv
label def dcsedfl_dv 2 "No", modify

Dear Clyde,
thanks for your replay. for those who were 45 at wave 1, I need to define an upper age limit in order to restrict selective survival effects (as the childless have higher mortality than fathers) and also to ease interpretation of cohort effects. so, I need to do some preliminary analysis looking at proportions who died during the follow-up period. Is the above command capture this issue? I am sorry if I was not clear before

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#12

30 Jun 2021, 18:25

Well, it's not quite right. But your data troubles me, and I need to understand it better to get the code right. At least in the example you show, there is no data from wave 1. And what does it mean when the wave is missing value?
Comment
Yara Issa

Join Date: Nov 2020

Posts: 42
#13

01 Jul 2021, 09:00

retrospective information for the marital history were collected in wave 1 and 6. For employment history 25% of the sample were collected in wave 1 then the rest of the 75% were collected in wave 5 for those who did not answer the question in wave 1. I set up the age of childless to be 45.
I have A): those who participated in wave 1 and they were already 45, I need them to be presented in wave 1, 5, 6 as I am only able to study them retrospectively. And B): those who joined in wave 1 and were less than 45, I have there retrospective and prospective information till they reach the age of 45 then they are childless.

For A): I have to select a upper age limit in order to avoid sample bias in order to know how many of them died during the follow up period. as these individual might have different characteristics that allowed them to live longer than other. say for example, men aged 45-70 in wave 1.

in the data example I provided i reliesed an error which is when I merged the variable deceased I used merge m:1 instead of 1:1 so now I have extra pidp with missing values. can I solve this without re-doing the data all from scratch? i.e. can I remove those un-needed pidp ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

01 Jul 2021, 09:49

in the data example I provided i reliesed an error which is when I merged the variable deceased I used merge m:1 instead of 1:1 so now I have extra pidp with missing values. can I solve this without re-doing the data all from scratch? i.e. can I remove those un-needed pidp ?

I'm not sure what to tell you about that. If the two data sets actually are uniquely identified by the variable(s) used as the merge key, then even if you used m:1 in the command, you will get the same result as if you used 1:1.

If, however, the master data set actually contains multiple observations for some or all merge key variable(s) values, then there is the problem of which one to keep. The afer way to proceed in this case is to back up to the unmerged data sets, and within the one that has multiple observations apply appropriate commands to reduce to one observation per merge key using whatever the appropriate selection criterion is, and then re-run the -merge- as 1:1 with that reduced data set. This will probably take more time than just ripping some surplus observations out of the (mis-)merged data set you already have, but it is much less likely to introduce new errors. And I assume you are not in a hurry to get the wrong answers.
Comment
Yara Issa

Join Date: Nov 2020

Posts: 42
#15

02 Jul 2021, 12:52

dear Clyde,

If I want to drop those who did not survive to wave 5 and 6 and aged 45 in wave 1 (variable age_45). Would this command be correct?

by pidp, sort: egen first56_wavee = min(wave)

drop if first56_wavee <= 5 & first56_wavee <= 6
Comment

Announcement

combining retrospective and prospective information

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment