Replacing observations appearing multiple times on the same variable on the same participant. Longitudinal study.

Ingvild Lappegaard

Join Date: Nov 2020

Posts: 3
#1

Replacing observations appearing multiple times on the same variable on the same participant. Longitudinal study.

22 Jan 2021, 08:05

Hi again. I have a new problem, and I'll try to explain it the best can.

First, som background info. I am studying retirement age, through a longitudinal study collected in three rounds. The participants had a reference number (ref_nr), which also told me how many rounds they participated in. I wanted to know the participants retirement age, and therefore generated a new variable retirementage, based on the variables I had; participants year of retirement, year the interview was held, participants age at time of interview. I subtracted the participants age at the time of the interview from the year of the interview, which gave med the birth year of the participants. After this I could subtract their year of birth from year of retirement, and create my new variable retirementage. All of this went fine, I thought...

Some of the participants ended up with two different retirement ages, with one year apart, in my new variable retirementage. This is because the interviews in the three rounds not necessarily were done at the same time of the year. For example if a participant with birthday in April was interviewed in Feburary in 2002, at age 50, and then interviewed again in June in 2007, he/she would have turned 56 (and not 55), because of the time of the interviews. As you may understand, because of the calculations, this gave me different years of birth, and therefor different retirementage.

So all in all, IF the same reference number (ref_nr), i.e. the same candidate, have two values on retirementage, I want to replace one of the values on retirementage with ".", because this is giving me non-existing values of retirementage. Alternatively get the average age of the two, but I still want it to be represented only once for each participant.

OR

If any of you have suggestions on other ways to calculate retirement age based on the variables i have, that accounts for birth month somehow, so that I avoid this discrepancy.

I hope this was somewhat understandable, at that someone can help me!!

Thanks, in advance.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

22 Jan 2021, 10:50

Refining your estimation using the birth month would only be possible if you also have the month in which each survey was administered. Even so, you could still run into the same problem if two surveys were administered at different times in the same month that straddle the actual birth date, albeit this will be a much less frequent occurrence.

I don't see replacing one of the values as missing as making sense: you don't know which one is correct, and you are just as likely to clobber the correct one as the wrong one. I think using the average makes more sense.

Code:

by participant_id, sort: egen mean_retirement_age = mean(retirement_age)

Evidently, replace the variable names in the above code by the names of the corresponding variables in your data set. In the future, when asking for help with code, it is best to show example data, and to use the -dataex- command for that purpose. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Ingvild Lappegaard

Join Date: Nov 2020
Posts: 3

01 Feb 2021, 05:39

Thank you for your respond, but it did not quite solve the problem. Below I have attached exported an example from my dataset, as you requested. I generated a new variable with the mean retirement age, as you suggested, but I am still stuck with two retirement ages for each ref_nr.

As you can see for ref_nr 27, 33, 43, I have two retirement ages for the same candidate. I fear that this causes "false" values in further analysis, because Stata thinks it represents two independent retirement ages.

So what I want to do is to only keep the retirement age reported in the last round they participated. So if ref_nr = the same, keep the observation from the last round they participated in. Or on the other side, replace the value on retirement age with "." from round(s) with the lowest value (1 or 2, depending on if they participated in round 1 and 2, or 2 and 3) and IF ref_nr = the same.

Following command represent what I want it to do, however it does not work.

replace A_retir_age == . if round == _n != _N & ref_nr == !=

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(ref_nr round) float A_retir_age
  5 2   70
  5 1   70
 10 2    .
 10 1    .
 15 2    .
 15 1    .
 18 3   60
 18 2   60
 27 2 66.5
 27 3 66.5
 33 3 61.5
 33 2 61.5
 43 3 64.5
 43 2 64.5
 54 1    .
 54 2    .
 70 2   61
 70 3   61
 70 1   61
 73 2    .
 73 1    .
 86 1    .
 86 2    .
100 2    .
100 1    .
104 1    .
104 2    .
107 3   67
107 2   67
121 1    .
121 2    .
123 2 58.5
123 3 58.5
131 3    .
131 1    .
134 2   63
134 3   63
135 2 61.5
135 3 61.5
137 2    .
137 1    .
141 3 66.5
141 2 66.5
144 2   63
144 3   63
146 2 66.5
146 3 66.5
166 2 56.5
166 3 56.5
185 2    .
185 1    .
200 2   62
200 3   62
210 3    .
210 1    .
210 2    .
212 1    .
212 2    .
216 2   62
216 3   62
216 1   62
219 2   63
219 3   63
220 2   66
220 3   66
220 1   66
229 2    .
229 1    .
240 2    .
240 1    .
245 1   62
245 2   62
255 1    .
255 2    .
256 1    .
256 2    .
258 2    .
258 1    .
264 3   67
264 2   67
265 2    .
265 1    .
268 2   67
268 3   67
268 1   67
269 2   62
269 1   62
270 1    .
270 2    .
272 2   66
272 3   66
275 3   67
275 2   67
280 3    .
280 2    .
283 2 62.5
283 3 62.5
291 1   64
291 3   64
291 2   64
end

I hope this made it more understandable!

Comment

Andrew Musau

Join Date: Oct 2014

Posts: 10194
#4

01 Feb 2021, 06:20

If you plan to use the retirement age variable in your analysis, what effectively you are doing is to restrict the sample to the last cross-sections where retirement age was recorded. Observations with missing values are discarded with listwise deletion. This will give you what you want:

Code:

bys ref_nr (round): replace A_retir_age=. if !missing(A_retir_age[_n+1]) | !missing(A_retir_age[_n+2])
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

01 Feb 2021, 12:41

In addition to the approach outlined in #4, you could also do this as:

Code:

bysort refnr (round): keep if _n == _N

This will actually reduce the data set to a single observation per ref_nr, rather than retaining observations you have no further use for and clobbering their values of A_retir_age with missing values.
Comment

Announcement