Dealing with reports of repeated time values within panel: drop variables

Rogier Jansen

Join Date: Jul 2018

Posts: 12
#1

Dealing with reports of repeated time values within panel: drop variables

16 Jul 2018, 11:03

I am preparing a data set (Russian Longitudinal monitoring survey) for analysis. Individual and household data are merged - using household as master - based on year and family ID (id_h). I want to focus on household level but need some of the individual data as well.

I get the following error using the xtset (stata/SE 14.2 for windows)

. xtset id_h year
repeated time values within panel
r(451);

The error is probably due to the fact the there are multiple individuals that have been interviewed within the household. Unfortunately, I cleaned my data set and it would take a long time to merge individual with household data set again.

I have to get rid of the individuals in the household that did not report the cost of variable: A,B,C, or D. A difficulty arises when none of the individuals report any costs. Also, because it is a panel data set, it has to be year specific. if person 1 gave an amount of cost for one of the variables A, B, C, or D in 2004 I want to keep this person and get rid of the other individuals (that are in the same household and interviewed in the same year).

If none of the individuals report any costs, then I want to keep the individual who does Not have a value for idind.

I looked into the FAQ: How do I deal with a report of repeated time values within panel? Written by Nicholas J. Cox.

. duplicates list id_h year

. duplicates tag id_h year, gen(isdup)

But now I don't know how to keep the individual that reported costs for any of the variables or if no costs are reported drop the individuals with a value for idind.

Sorry, if it is a bit confusing and not written in proper format.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int year long(id_h idind OOP_Medicine) double(OOP_Inpatient OOP_Outpatient OOP_Dental_treatment) 2009 1001 1 . . . . 2009 1001 . . . . . 2009 1001 24101 . . . . 2009 1001 11293 . . . . 2009 1002 5 . . . . 2009 1002 . . . . . 2009 1003 7 . . . . 2009 1003 . . . . . 2009 1004 9 . . . . 2009 1004 . . . . . 2009 1006 11291 . . . . 2009 1006 . . . 6300 950 2009 1014 25031 . . . . 2009 1014 31360 . . . . 2009 1014 . . . 3000 . 2009 1021 14369 . . . . 2009 1021 29 . . . . 2009 1021 14370 . . . . 2009 1021 16255 . . . . 2009 1021 28 . . . . 2009 1021 . . . 1200 . 2009 1021 11296 . . . . 2009 1021 30 . . . . 2009 1036 25037 . . . . 2009 1036 . . . . . 2009 1036 25036 . . . . 2009 1037 25041 . . . . 2009 1037 . . 1500 1000 . 2009 1037 25039 . . . . 2009 1037 31343 . . . . 2009 1037 25040 . . . . 2009 1037 25038 . . . . 2009 1038 . . . . . 2009 1038 30123 . . . . 2009 1044 11328 . . . . 2009 1044 . . . 3336 15000 2009 1044 11329 . . . . end label values OOP_Medicine E13_3_1B label values OOP_Inpatient E13_22B label values OOP_Outpatient E13_23B label values OOP_Dental_treatment E13_24B

Last edited by Rogier Jansen; 16 Jul 2018, 11:28.
Tags: panel data
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17672
#2

16 Jul 2018, 11:35

Rogier:
the work-around is to -xtset- your data with -panelid- only (provided that you do not plan to use time-series commands, such as lags and leads):

Code:

xtset id_h

That said, I would be much more concern about the number of missing values (and how to do with it), unless part (or all) of them are, in fact, zeros (that is, person did not pay for out-of-pocket (?) medicine and so on during the span of time covered by your dataset).

Kind regards,
Carlo
(StataNow 18.5)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#3

16 Jul 2018, 11:40

Well, in your example data, within any year-id_h combination there is at most one idind (including idind == .) who responds to any of the OOP_* items. If that is true in your data overall, then a simple -collapse- can get you what you want:

Code:

// VERIFY ONLY ONE INDID RESPONDS WITHIN YEAR-ID_H reshape long OOP_, i(year id_h idind) j(item) string by year id_h idind, sort: egen responded = count(OOP_) replace responded = !!responded reshape wide by year id_h: egen responders = total(responded) assert inlist(responders, 0, 1) // NOW COLLAPSE TO ONE OBS PER YEAR ID_H replace idind = . if !responded sort year id_h idind collapse (firstnm) OOP* (first) idind, by(year id_h)

Added: Crossed with #2. What Carlo says about -xtset- is quite true and often useful.

I interpreted the missing data quite differently. I assumed that the nature of the survey is that only one (at most) responder within a household was asked these items, so that the pattern of all but one (or all) responses within a household-year being missing is intentional and the original poster wishes to capture that one person's responses. I'm not sure with the missing idind is about; perhaps in some cases the particular person responding to these items is not identified for some reason. Of course, if I'm wrong and this pattern is just a fluke that occurred in a small sample, the -assert- in my code will fail and Rogier Jansen will know not to use it.

If my generalization about the missing data is false, then the question arises: if more than one person within an id_h year does respond to these items, which one do you want to retain (or do you want to combine the responses in some way such as adding them up)?

Last edited by Clyde Schechter; 16 Jul 2018, 11:45.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35405
#4

16 Jul 2018, 11:48

Credit where credit is due: the FAQ alluded to also has author Michael Mulcahy. https://www.stata.com/support/faqs/d...d-time-values/ is the precise URL.

For xtset with identifier and year to work at all, you can have only one individual in each household. You give two rules

1. if person 1 gave an amount of cost for one of the variables A, B, C, or D in 2004 [or any other year?] I want to keep this person and drop the other individuals in the same household and year

3. If none of the individuals report any costs, then I want to keep the individual who does not have a value for idind.

I have labelled these 1 and 3 because my guess is that you need to spell out what happens if the other persons gave a value for A or B or C or D, especially if two or more did. That would be rule 2.

also is it necessarily true that under 3 at most one individual qualifies?

In short, we need all your rules to have a chance of suggesting code.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17672
#5

16 Jul 2018, 11:51

I do appreciate and share Clyde and Nick's insightful takes on the original post.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Rogier Jansen

Join Date: Jul 2018

Posts: 12
#6

16 Jul 2018, 13:19

Originally posted by Carlo Lazzaro View Post

Rogier:
the work-around is to -xtset- your data with -panelid- only (provided that you do not plan to use time-series commands, such as lags and leads):

Code:

xtset id_h

That said, I would be much more concern about the number of missing values (and how to do with it), unless part (or all) of them are, in fact, zeros (that is, person did not pay for out-of-pocket (?) medicine and so on during the span of time covered by your dataset).

Thanks for the prompt reply Mr Lazzaro.

I may want to see the development of OOP over time, not sure yet how to tackle the unobserved heterogeneity.

Thanks for the remark, the missing values are indeed 0.
Comment
Rogier Jansen

Join Date: Jul 2018

Posts: 12
#7

16 Jul 2018, 13:23

Originally posted by Clyde Schechter View Post

Well, in your example data, within any year-id_h combination there is at most one idind (including idind == .) who responds to any of the OOP_* items. If that is true in your data overall, then a simple -collapse- can get you what you want:

Code:

// VERIFY ONLY ONE INDID RESPONDS WITHIN YEAR-ID_H reshape long OOP_, i(year id_h idind) j(item) string by year id_h idind, sort: egen responded = count(OOP_) replace responded = !!responded reshape wide by year id_h: egen responders = total(responded) assert inlist(responders, 0, 1) // NOW COLLAPSE TO ONE OBS PER YEAR ID_H replace idind = . if !responded sort year id_h idind collapse (firstnm) OOP* (first) idind, by(year id_h)

Added: Crossed with #2. What Carlo says about -xtset- is quite true and often useful.

I interpreted the missing data quite differently. I assumed that the nature of the survey is that only one (at most) responder within a household was asked these items, so that the pattern of all but one (or all) responses within a household-year being missing is intentional and the original poster wishes to capture that one person's responses. I'm not sure with the missing idind is about; perhaps in some cases the particular person responding to these items is not identified for some reason. Of course, if I'm wrong and this pattern is just a fluke that occurred in a small sample, the -assert- in my code will fail and Rogier Jansen will know not to use it.

If my generalization about the missing data is false, then the question arises: if more than one person within an id_h year does respond to these items, which one do you want to retain (or do you want to combine the responses in some way such as adding them up)?

thanks for the prompt reply Mr Schechter.

I may not have been explicit enough.

It is most likely the case that there are multiple idind respondents in a year-id_h. This is because the data sets have been merged, and only the individual data set has the idind value. That is, the person interviewed for the household data set does not have this value.

The question for OOP is only asked in the household survey. When looking at the data, indeed only a person without idind seems to have answered the OOP question. When I run your command only idind == . are left (I saved a duplicate data set).

I just noticed that when merging the data, I did not get any individual characteristics for the person being interviewed for the household survey.

The reason I merged the Household data with individual data is that the household data does not have social demographic characteristics (age,sex, race etc.). And I thought there was only one individual being interviewed.

It seems that I will have to identify, the head of the household from the other individuals of the household that was interviewed. Based on these criteria

(1) the oldest working-aged male in the household, (2) if no working-aged males, then the oldest working-age female, (3) if no working-age females, then the youngest retirement-age male, (4) if no retirement-age males, then the youngest retirement-age female, and finally (5) if no retirement-age females, then the oldest child.

I hope you can help me out again.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#8

16 Jul 2018, 14:15

I hope you can help me out again.

I probably can, but not without example data. And, if there isn't already a variable indicating who is of retirement age, then you need to tell me what age that is for males and females in Russia.
Comment

Rogier Jansen

Join Date: Jul 2018
Posts: 12

16 Jul 2018, 15:22

Thanks.

The data consists of 700 variables. It would be nice to identify the household head and add the data inputs from the individual questionnaire such as:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long Religion byte urban int region byte(HH_size1 HH_size) long(b1_8 Voluntary_life_insurance MHI) byte borrow_for_HC
. 3 1 . . . . 1 .
. 3 1 . . . . 1 .
. 3 1 . . . . . .
. 3 1 3 . . . . .
. 3 1 . . . . 1 .
. 3 1 . . . . 2 .
. 3 1 . . . . . .
. 3 1 . . . . . .
. 3 1 4 . . . . .
. 3 1 . . . . 1 .
. 3 1 1 . . . . .
. 3 1 2 . . . . .
. 3 1 . . . . 2 .
. 3 1 . . . . 2 .
. 3 1 . . . . . .
. 3 1 3 . . . . .
. 3 1 . . . . 1 .
. 3 1 . 2 . . . .
. 3 1 . . . . 1 .
. 3 1 . . . . 1 .
. 3 1 2 . . . . .
. 3 1 . . . . 1 .
. 3 1 . . . . 1 .
. 3 1 . . . . 2 .
. 3 1 . . . . . .
. 3 1 4 . . . . .
. 3 1 . . . . 1 .
end
label values Religion J72_19
label values urban STATUS
label def STATUS 3 "pgt", modify
label values region REGION
label def REGION 1 "Leningrad Oblast: Volosovkij Rajon", modify
label values b1_8 B1_8
label values Voluntary_life_insurance J170_1
label values MHI L2
label def L2 1 "Yes", modify
label def L2 2 "No", modify
label values borrow_for_HC J201_06

So the row of the household head can be identified by idind == . Below I added several variables that may be helpful to identify who the household head is: Work_status , BirtY_1st HH_Member etc. are the birth years of the household members (starting from the oldest person). The retirement age for men is 60 and women 55.

The individuals interviewed will have a variable for idind . The household can be identified by id_h .

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(id_h idind) int YOB byte Gender long(Work_status Pension_retirement BirthY_1stHH_member BirthY_2ndHH_member BirthY_3rdHH_member BirthY_4th_HH_member BirthY_5th_HH_member) int BirthY_6th_HH_member
10101     1 1973 2 2 .    .    .    .    .    .    .
10101     2 1971 1 1 .    .    .    .    .    .    .
10101 11293 1995 1 . .    .    .    .    .    .    .
10101     .    . . . . 1973 1971 1995    .    .    .
10102     3 1955 2 1 .    .    .    .    .    .    .
10102     4 1949 1 5 .    .    .    .    .    .    .
10102     5 1984 2 . .    .    .    .    .    .    .
10102     6 1986 1 . .    .    .    .    .    .    .
10102     .    . . . . 1955 1949 1984 1986    .    .
10106     .    . . . . 1970 1965 1994    .    .    .
10107 11294 1972 1 1 .    .    .    .    .    .    .
10107     .    . . . . 1952 1972    .    .    .    .
10108    16 1938 2 5 .    .    .    .    .    .    .
10108    17 1934 1 1 .    .    .    .    .    .    .
10108     .    . . . . 1938 1934    .    .    .    .
10109    18 1954 2 1 .    .    .    .    .    .    .
10109    19 1954 1 1 .    .    .    .    .    .    .
10109    20 1976 1 5 .    .    .    .    .    .    .
10109    21 1986 2 . .    .    .    .    .    .    .
10109     .    . . . . 1954 1954 1976 1986    .    .
10201 11299 1959 2 1 .    .    .    .    .    .    .
10201 11300 1954 1 1 .    .    .    .    .    .    .
10201 11301 1977 2 5 .    .    .    .    .    .    .
10201 11302 1985 1 . .    .    .    .    .    .    .
10201 11303 1985 2 . .    .    .    .    .    .    .
10201 11304 1987 2 . .    .    .    .    .    .    .
10201 11305 1992 2 . .    .    .    .    .    .    .
10201     .    . . . . 1959 1954 1977 1985 1985 1987
10202    22 1961 2 5 .    .    .    .    .    .    .
10202    24 1981 2 . .    .    .    .    .    .    .
10202    25 1990 1 . .    .    .    .    .    .    .
10202 11306 1957 1 1 .    .    .    .    .    .    .
10202     .    . . . . 1961    . 1981 1990 1957    .
10203    26 1931 2 5 .    .    .    .    .    .    .
10203    27 1979 1 5 .    .    .    .    .    .    .
10203 11307 1953 2 1 .    .    .    .    .    .    .
10203     .    . . . . 1931 1979 1953    .    .    .
10204 11308 1970 1 1 .    .    .    .    .    .    .
10204 11309 1973 2 2 .    .    .    .    .    .    .
10204 11310 1994 1 . .    .    .    .    .    .    .
10204     .    . . . . 1970 1973 1994    .    .    .
end
label values YOB Year_of_birth
label values Gender H5
label def H5 1 "male", modify
label def H5 2 "female", modify
label values Work_status J1
label def J1 1 "You are currently working", modify
label def J1 2 "You are on paid leave: maternity leave or taking care of a child under 3 years of age", modify
label def J1 5 "You are not working", modify
label values Pension_retirement J74_1
label values BirthY_1stHH_member B1_5_BirthY_1stHH_M
label values BirthY_2ndHH_member B2_5
label values BirthY_3rdHH_member B3_5
label values BirthY_4th_HH_member B4_5
label values BirthY_5th_HH_member B5_5
label values BirthY_6th_HH_member B6_5

Ones the household head is identified and the data from respective person is added to the row of household head, I would like to delete the other individuals from the data set.

Thanks a lot

Comment

Rogier Jansen

Join Date: Jul 2018

Posts: 12
#10

16 Jul 2018, 15:27

Originally posted by Nick Cox View Post

Credit where credit is due: the FAQ alluded to also has author Michael Mulcahy. https://www.stata.com/support/faqs/d...d-time-values/ is the precise URL.

For xtset with identifier and year to work at all, you can have only one individual in each household. You give two rules

1. if person 1 gave an amount of cost for one of the variables A, B, C, or D in 2004 [or any other year?] I want to keep this person and drop the other individuals in the same household and year

3. If none of the individuals report any costs, then I want to keep the individual who does not have a value for idind.

I have labelled these 1 and 3 because my guess is that you need to spell out what happens if the other persons gave a value for A or B or C or D, especially if two or more did. That would be rule 2.

also is it necessarily true that under 3 at most one individual qualifies?

In short, we need all your rules to have a chance of suggesting code.

You are right, I should've referenced properly.

Also, both your other points are correct.

Thanks for clarifying and help
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#11

16 Jul 2018, 15:38

Still missing some crucial information. The birth years are fine as far as they go, but without knowing when the survey was administered, there is no way to calculate age. Also, even if I know the survey was carried out in 2000 and a man was born in 1940, that tells me that he will reach retirement age at some point in 2000, it doesn't tell me whether the survey was before or after his birthday, so it is unclear how to classify such a person on being retirement age or not.

Also, I want to be clear that your definition of household head depends on whether a person is of retirement age, not on whether they are actually working/retired. If that's not the case, and it's actually based on work status, then please clarify how you want to handle the categories of work status that actually appear in the data, and, in particular, what to do with the numerous cases for which it is missing.
Comment
Rogier Jansen

Join Date: Jul 2018

Posts: 12
#12

17 Jul 2018, 05:49

Originally posted by Clyde Schechter View Post

Still missing some crucial information. The birth years are fine as far as they go, but without knowing when the survey was administered, there is no way to calculate age. Also, even if I know the survey was carried out in 2000 and a man was born in 1940, that tells me that he will reach retirement age at some point in 2000, it doesn't tell me whether the survey was before or after his birthday, so it is unclear how to classify such a person on being retirement age or not.

Also, I want to be clear that your definition of household head depends on whether a person is of retirement age, not on whether they are actually working/retired. If that's not the case, and it's actually based on work status, then please clarify how you want to handle the categories of work status that actually appear in the data, and, in particular, what to do with the numerous cases for which it is missing.

Thanks for your help Mr Schechter.

I found a solution.

(The definition given for head of the household is quoted from the database website)
Comment

Announcement