Matching two variables

Stef Anie

Join Date: Dec 2016

Posts: 25
#1

Matching two variables

10 Dec 2016, 12:46

Hello everyone, I hope there is anybody who can help me.

I have got the following problem:

I have got a panel data set consisting of 18 waves. Every individual has got a number called "pid". The variable "mpid" shows the "pid" of the individual´s mother. Then I want to link these two. Finally I want to have a data set which consists just out of the individuals and their mothers (those marked for example with a dummy which is one if the individual is a mother and zero otherwise). All other individuals should be eliminated.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

10 Dec 2016, 13:01

Depending on how these variables are coded, there are different approaches. The possibilities are numerous. You need to post an example of your data. Also, you need to explain what you mean by saying you want to link the persons and their mothers. From what you describe, it sounds like they are already linked. So give a clear description of what you want to get.

In posting your example data, be sure to use the -dataex- command so that whoever responds can easily create a replica of the example in their own Stata set up. You can install the -dataex- command by running -ssc install dataex-. Instructions for using it are in -help dataex-.
Comment
Stef Anie

Join Date: Dec 2016

Posts: 25
#3

10 Dec 2016, 13:10

I´m new here and unfortunately I don´t know how to use the data ex. Therefore I try to describe it in a better way:

I have got a data set which looks like this:

Pid sex age mpid smoker
100 1 25 103 1
101 0 59 . 0
102 0 40 168 1
103 1 60 184 1

here you can see that the mother of individual with pid=100 is individual 103. Then you can see that the mother of the individual is smoking as well. My aim is to have finally a sample with only individuals and their mothers left where the mothers are marked with a dummy called "mother" which is one if the individual is a mother. I have to link the mothers of the smoking individuals to them.
Hope it´s now clear. I am referring to the BHPS.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

10 Dec 2016, 13:30

Well, although your example didn't use -dataex-, it is easy enough to import to Stata. I'm still not entirely clear on what you want. But here is some code that will reduce the data set to those observations that either have a mother or are a mother. It is possible that there are people in the data who both have and are a mother. The code below leaves them in the data set twice, once as a mother and once as having a mother.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(Pid sex age mpid smoker) 100 1 25 103 1 101 0 59 . 0 102 0 40 168 1 103 1 60 184 1 end // CREATE A DATA SET WITH JUST MOTHERS preserve keep mpid duplicates drop tempfile mother_ids rename mpid Pid save `mother_ids' restore, preserve merge 1:1 Pid using `mother_ids', keep(match) nogenerate gen mother = 1 tempfile mothers save `mothers' // NOW ELIMINATE FROM ORIGINAL DATA // THOSE OBSERVATIONS WITH NO MOTHER restore drop if missing(mpid) gen mother = 0 // NOW COMBINE THE MOTHERS AND THEIR OFFSPRING append using `mothers'

This may or may not be what you wanted. If it's not, please post back showing an example of what you would like the result to look like.

As for not knowing how to use -dataex-, in my earlier post I told you how to install it and indicated that the directions are in the associated help file. It is among the simplest of Stata's commands to use. I'm quite sure that if you put a few minutes effort into it, you will easily learn it.
Comment

Stef Anie

Join Date: Dec 2016
Posts: 25

10 Dec 2016, 13:36

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(pid mpid fpid) int age byte smoker
10020233 10020209 10020179 19 1
10048243 10048219 10048189 21 2
10048278 10048219 10048189 19 2
10079599 10079556 10079521 18 2
10101977 10101942 10101918 33 2
end
label values mpid ampid
label values fpid afpid
label values age aage
label values smoker asmoker
label def asmoker 1 "yes", modify
label def asmoker 2 "no", modify

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(pid mpid fpid) int age byte smoker
10020233 10020209 10020179 19 1
10048243 10048219 10048189 21 2
10048278 10048219 10048189 19 2
10079599 10079556 10079521 18 2
10101977 10101942 10101918 33 2
end
label values mpid ampid
label values fpid afpid
label values age aage
label values smoker asmoker
label def asmoker 1 "yes", modify
label def asmoker 2 "no", modify

Comment

Stef Anie

Join Date: Dec 2016

Posts: 25
#6

10 Dec 2016, 13:46

Thank you very much I think that´s it. I have to do the same for the fathers with fpid. So it would be the best with a Dummy called parents in the end. Can I use the same code just with an fpid instead of mpid for the fathers after I have used the one for the mothers?
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29796

10 Dec 2016, 15:11

Well, if you want to do it with both mothers and fathers, you could do it separately for each and then combine them. But it would be simpler to do it in one fell swoop:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long(pid mpid fpid) int age byte smoker
10020233 10020209 10020179 19 1
10048243 10048219 10048189 21 2
10048278 10048219 10048189 19 2
10079599 10079556 10079521 18 2
10101977 10101942 10101918 33 2
end
label values mpid ampid
label values fpid afpid
label values age aage
label values smoker asmoker
label def asmoker 1 "yes", modify
label def asmoker 2 "no", modify

//    CREATE A DATA SET WITH JUST MOTHERS
preserve
keep pid mpid fpid
drop if missing(mpid) & missing(fpid)
rename pid key
reshape long @pid, i(key) j(parent) string
drop key
duplicates drop
tempfile parent_ids
save `parent_ids'
restore, preserve
merge 1:1 pid using `parent_ids', keep(match) nogenerate
tempfile parents
save `parents'

//    NOW ELIMINATE FROM ORIGINAL DATA
//    THOSE OBSERVATIONS WITH NO MOTHER
restore
drop if missing(mpid) & missing(fpid)
gen parent = ""

//    NOW COMBINE THE MOTHERS AND THEIR OFFSPRING
append using `parents'
gen byte mother = parent == "m"
gen byte father = parent == "f"

This code creates a single data set in which everybody is a parent or has a parent in the data set. It also provides a variable, parent, which contains m if the person is a mother, f if a father, and missing value if the person is not a parent. Finally, it includes variables mother and father which are 0/1 coded to indicate who is a mother and who a father, respectively. (In the case of your example data from #5, none of the mother or father id's appear as pid's, so the result is not very interesting--just the original data and an indication that nobody is a parent.)

Thank you for using -dataex-.

Comment

Stef Anie

Join Date: Dec 2016

Posts: 25
#8

10 Dec 2016, 15:24

Thank you very much for your help, Clyde. I highly appreciate it
Comment
Stef Anie

Join Date: Dec 2016

Posts: 25
#9

10 Dec 2016, 16:19

With the second code you send I get the failure "variable pid does not uniquely identify observations in the master data"
Do you know what the problem is?
Comment
Stef Anie

Join Date: Dec 2016

Posts: 25
#10

10 Dec 2016, 16:35

Unfortunately now I get that one

drop if missing(mpid) & missing(fpid)
(0 observations deleted)

. rename pid key

. reshape long @pid, i(key) j(parent) string
(note: j = f m)
variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You
specified i(key) and j(parent). In the current wide form, variable key
should uniquely identify the observations. Remember this picture:

long wide
+---------------+ +------------------+
| i j a b | | i a1 a2 b1 b2 |
|---------------| <--- reshape ---> |------------------|
| 1 1 1 2 | | 1 1 3 2 4 |
| 1 2 3 4 | | 2 5 7 6 8 |
| 2 1 5 6 | +------------------+
| 2 2 7 8 |
+---------------+
Type reshape error for a list of the problem observations.
r(9);
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#11

10 Dec 2016, 16:58

OK. The variable -key- is just a rename of the original variable pid. So Stata is complaining because the -reshape- command expects pid to uniquely identify observations. I expected that, too, based on a) the fact that it is true in your example data, and b) it seems a sensible assumption in a data set that appears to be about people and doesn't look chronological. So I assumed that there was one observation per person. Apparently that is not the case: I have never known Stata to be wrong when it says this.

So the first question is whether the presence of multiple observations per person is appropriate or represents an error in your data.

If it is appropriate to have more than one observation for the same person in your data, then there must be some other variable that distinguishes them, such as a date of participation or something like that. In that case, the code would have to be changed to accommodate this. Assuming that there is a variable called date that, together with pid, uniquiely identifies observations, the code would look like this:

Code:

// CREATE A DATA SET WITH JUST MOTHERS preserve keep pid mpid fpid drop if missing(mpid) & missing(fpid) rename pid key reshape long @pid, i(key date) j(parent) string drop key duplicates drop tempfile parent_ids save `parent_ids' restore, preserve merge 1:1 pid date using `parent_ids', keep(match) nogenerate tempfile parents save `parents' // NOW ELIMINATE FROM ORIGINAL DATA // THOSE OBSERVATIONS WITH NO MOTHER restore drop if missing(mpid) & missing(fpid) gen parent = "" // NOW COMBINE THE MOTHERS AND THEIR OFFSPRING append using `parents' gen byte mother = parent == "m" gen byte father = parent == "f"

If there is no other variable in your data that, taken together with pid, uniquely identifies observations, then it is likely that your data contains errors. Why would you have two separate observations for the same person with nothing to distinguish them? If the two observations are identical on all variables, then all but one are superfluous and you could just eliminate them with -duplicates drop-. But if you have two observations on the same person, with no other identifiers for the observation, and they also disagree on something (e.g. they report different smoking status, or give different values for mpid or fpid) then at least one of them is clearly a mistake.

If it is not appropriate to have more than one observation for the same person in your data, then you need to identify these observations and reduce your data set to one per person either by deleting the surplus observations or combining the multiple observations in some way. There are too many different scenarios here to give concrete guidance. But to start, you can identify the surplus observations by running -duplicate list pid-, or -duplicates tag pid, gen(flag)- followed by -browse if flag-. If a solution to the problem is not apparent, do post back with examples.

Last edited by Clyde Schechter; 10 Dec 2016, 17:01. Reason: For some reason, the code block got smushed into one long line. Fixed that.
Comment
Stef Anie

Join Date: Dec 2016

Posts: 25
#12

10 Dec 2016, 17:20

It is panel data therefore it is more than once. I created the panel like this:

sysuse bindresp.dta
renpfix b
gen year=1992
save wave2

clear
sysuse cindresp.dta
renpfix c
gen year=1993
save wave3

sysuse wave1
append using wave2
append using wave3

But I think the year variable is not there anymore and there is not other identifying variable I think
Comment
Stef Anie

Join Date: Dec 2016

Posts: 25
#13

10 Dec 2016, 17:23

year is still there I found it. Shall I replace all the dates with "year" in the code?
But then I get the error that year is not found...

Last edited by Stef Anie; 10 Dec 2016, 17:26.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#14

10 Dec 2016, 17:26

But I think the year variable is not there anymore and there is not other identifying variable I think

Yes, it sounds like year is the other identifying variable. Assuming you are not going forward with analyses that actually need the year information, you can do a workaround as follows:

Code:

by pid, sort: gen int seq = _n

Now pid and seq will jointly identify unique observations. So you can replace date by seq in the code in #11 and you should be OK. Just remember later on that seq is just an arbitrary counter that does not necessarily run in the same order as the original years did. So if you later need to do an analysis the requires year information, you can't substitute seq for that purpose--you will need to go back and rebuild the data set and retain the year variable.
Comment
Stef Anie

Join Date: Dec 2016

Posts: 25
#15

10 Dec 2016, 17:36

The variable seq is not found even though I typed your additional command in before the code. Do you know why?
I also tried to keep "year" as well and replace that with year as well but then the 1:1 merge had an error

If I keep "seq" as well then this happens at merging:

. merge 1:1 pid seq using `parent_ids', keep(match) nogenerate

variables pid seq do not uniquely identify observations in the using data
r(459);

Last edited by Stef Anie; 10 Dec 2016, 17:41.
Comment

Announcement