Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching two variables

    Hello everyone, I hope there is anybody who can help me.

    I have got the following problem:

    I have got a panel data set consisting of 18 waves. Every individual has got a number called "pid". The variable "mpid" shows the "pid" of the individual´s mother. Then I want to link these two. Finally I want to have a data set which consists just out of the individuals and their mothers (those marked for example with a dummy which is one if the individual is a mother and zero otherwise). All other individuals should be eliminated.

  • #2
    Depending on how these variables are coded, there are different approaches. The possibilities are numerous. You need to post an example of your data. Also, you need to explain what you mean by saying you want to link the persons and their mothers. From what you describe, it sounds like they are already linked. So give a clear description of what you want to get.

    In posting your example data, be sure to use the -dataex- command so that whoever responds can easily create a replica of the example in their own Stata set up. You can install the -dataex- command by running -ssc install dataex-. Instructions for using it are in -help dataex-.

    Comment


    • #3
      I´m new here and unfortunately I don´t know how to use the data ex. Therefore I try to describe it in a better way:

      I have got a data set which looks like this:

      Pid sex age mpid smoker
      100 1 25 103 1
      101 0 59 . 0
      102 0 40 168 1
      103 1 60 184 1

      here you can see that the mother of individual with pid=100 is individual 103. Then you can see that the mother of the individual is smoking as well. My aim is to have finally a sample with only individuals and their mothers left where the mothers are marked with a dummy called "mother" which is one if the individual is a mother. I have to link the mothers of the smoking individuals to them.
      Hope it´s now clear. I am referring to the BHPS.

      Comment


      • #4
        Well, although your example didn't use -dataex-, it is easy enough to import to Stata. I'm still not entirely clear on what you want. But here is some code that will reduce the data set to those observations that either have a mother or are a mother. It is possible that there are people in the data who both have and are a mother. The code below leaves them in the data set twice, once as a mother and once as having a mother.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(Pid sex age mpid smoker)
        100 1 25 103 1
        101 0 59   . 0
        102 0 40 168 1
        103 1 60 184 1
        end
        
        //    CREATE A DATA SET WITH JUST MOTHERS
        preserve
        keep mpid
        duplicates drop
        tempfile mother_ids
        rename mpid Pid
        save `mother_ids'
        restore, preserve
        merge 1:1 Pid using `mother_ids', keep(match) nogenerate
        gen mother = 1
        tempfile mothers
        save `mothers'
        
        //    NOW ELIMINATE FROM ORIGINAL DATA
        //    THOSE OBSERVATIONS WITH NO MOTHER
        restore
        drop if missing(mpid)
        gen mother = 0
        
        //    NOW COMBINE THE MOTHERS AND THEIR OFFSPRING
        append using `mothers'
        This may or may not be what you wanted. If it's not, please post back showing an example of what you would like the result to look like.

        As for not knowing how to use -dataex-, in my earlier post I told you how to install it and indicated that the directions are in the associated help file. It is among the simplest of Stata's commands to use. I'm quite sure that if you put a few minutes effort into it, you will easily learn it.

        Comment


        • #5
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long(pid mpid fpid) int age byte smoker
          10020233 10020209 10020179 19 1
          10048243 10048219 10048189 21 2
          10048278 10048219 10048189 19 2
          10079599 10079556 10079521 18 2
          10101977 10101942 10101918 33 2
          end
          label values mpid ampid
          label values fpid afpid
          label values age aage
          label values smoker asmoker
          label def asmoker 1 "yes", modify
          label def asmoker 2 "no", modify
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long(pid mpid fpid) int age byte smoker
          10020233 10020209 10020179 19 1
          10048243 10048219 10048189 21 2
          10048278 10048219 10048189 19 2
          10079599 10079556 10079521 18 2
          10101977 10101942 10101918 33 2
          end
          label values mpid ampid
          label values fpid afpid
          label values age aage
          label values smoker asmoker
          label def asmoker 1 "yes", modify
          label def asmoker 2 "no", modify

          Comment


          • #6
            Thank you very much I think that´s it. I have to do the same for the fathers with fpid. So it would be the best with a Dummy called parents in the end. Can I use the same code just with an fpid instead of mpid for the fathers after I have used the one for the mothers?

            Comment


            • #7
              Well, if you want to do it with both mothers and fathers, you could do it separately for each and then combine them. But it would be simpler to do it in one fell swoop:

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input long(pid mpid fpid) int age byte smoker
              10020233 10020209 10020179 19 1
              10048243 10048219 10048189 21 2
              10048278 10048219 10048189 19 2
              10079599 10079556 10079521 18 2
              10101977 10101942 10101918 33 2
              end
              label values mpid ampid
              label values fpid afpid
              label values age aage
              label values smoker asmoker
              label def asmoker 1 "yes", modify
              label def asmoker 2 "no", modify
              
              //    CREATE A DATA SET WITH JUST MOTHERS
              preserve
              keep pid mpid fpid
              drop if missing(mpid) & missing(fpid)
              rename pid key
              reshape long @pid, i(key) j(parent) string
              drop key
              duplicates drop
              tempfile parent_ids
              save `parent_ids'
              restore, preserve
              merge 1:1 pid using `parent_ids', keep(match) nogenerate
              tempfile parents
              save `parents'
              
              //    NOW ELIMINATE FROM ORIGINAL DATA
              //    THOSE OBSERVATIONS WITH NO MOTHER
              restore
              drop if missing(mpid) & missing(fpid)
              gen parent = ""
              
              //    NOW COMBINE THE MOTHERS AND THEIR OFFSPRING
              append using `parents'
              gen byte mother = parent == "m"
              gen byte father = parent == "f"
              This code creates a single data set in which everybody is a parent or has a parent in the data set. It also provides a variable, parent, which contains m if the person is a mother, f if a father, and missing value if the person is not a parent. Finally, it includes variables mother and father which are 0/1 coded to indicate who is a mother and who a father, respectively. (In the case of your example data from #5, none of the mother or father id's appear as pid's, so the result is not very interesting--just the original data and an indication that nobody is a parent.)

              Thank you for using -dataex-.

              Comment


              • #8
                Thank you very much for your help, Clyde. I highly appreciate it

                Comment


                • #9
                  With the second code you send I get the failure "variable pid does not uniquely identify observations in the master data"
                  Do you know what the problem is?

                  Comment


                  • #10
                    Unfortunately now I get that one

                    drop if missing(mpid) & missing(fpid)
                    (0 observations deleted)

                    . rename pid key

                    . reshape long @pid, i(key) j(parent) string
                    (note: j = f m)
                    variable id does not uniquely identify the observations
                    Your data are currently wide. You are performing a reshape long. You
                    specified i(key) and j(parent). In the current wide form, variable key
                    should uniquely identify the observations. Remember this picture:

                    long wide
                    +---------------+ +------------------+
                    | i j a b | | i a1 a2 b1 b2 |
                    |---------------| <--- reshape ---> |------------------|
                    | 1 1 1 2 | | 1 1 3 2 4 |
                    | 1 2 3 4 | | 2 5 7 6 8 |
                    | 2 1 5 6 | +------------------+
                    | 2 2 7 8 |
                    +---------------+
                    Type reshape error for a list of the problem observations.
                    r(9);

                    Comment


                    • #11
                      OK. The variable -key- is just a rename of the original variable pid. So Stata is complaining because the -reshape- command expects pid to uniquely identify observations. I expected that, too, based on a) the fact that it is true in your example data, and b) it seems a sensible assumption in a data set that appears to be about people and doesn't look chronological. So I assumed that there was one observation per person. Apparently that is not the case: I have never known Stata to be wrong when it says this.

                      So the first question is whether the presence of multiple observations per person is appropriate or represents an error in your data.

                      If it is appropriate to have more than one observation for the same person in your data, then there must be some other variable that distinguishes them, such as a date of participation or something like that. In that case, the code would have to be changed to accommodate this. Assuming that there is a variable called date that, together with pid, uniquiely identifies observations, the code would look like this:

                      Code:
                       //    CREATE A DATA SET WITH JUST MOTHERS
                      preserve
                      keep pid mpid fpid
                      drop if missing(mpid) & missing(fpid)
                      rename pid key
                      reshape long @pid, i(key date) j(parent) string
                      drop key
                      duplicates drop
                      tempfile parent_ids
                      save `parent_ids'
                      restore, preserve
                      merge 1:1 pid date using `parent_ids', keep(match) nogenerate
                      tempfile parents save `parents'
                      
                       //    NOW ELIMINATE FROM ORIGINAL DATA
                      //    THOSE OBSERVATIONS WITH NO MOTHER
                      restore
                      drop if missing(mpid) & missing(fpid)
                      gen parent = ""  
                      
                      //    NOW COMBINE THE MOTHERS AND THEIR OFFSPRING
                      append using `parents'
                      gen byte mother = parent == "m"
                      gen byte father = parent == "f"
                      If there is no other variable in your data that, taken together with pid, uniquely identifies observations, then it is likely that your data contains errors. Why would you have two separate observations for the same person with nothing to distinguish them? If the two observations are identical on all variables, then all but one are superfluous and you could just eliminate them with -duplicates drop-. But if you have two observations on the same person, with no other identifiers for the observation, and they also disagree on something (e.g. they report different smoking status, or give different values for mpid or fpid) then at least one of them is clearly a mistake.

                      If it is not appropriate to have more than one observation for the same person in your data, then you need to identify these observations and reduce your data set to one per person either by deleting the surplus observations or combining the multiple observations in some way. There are too many different scenarios here to give concrete guidance. But to start, you can identify the surplus observations by running -duplicate list pid-, or -duplicates tag pid, gen(flag)- followed by -browse if flag-. If a solution to the problem is not apparent, do post back with examples.
                      Last edited by Clyde Schechter; 10 Dec 2016, 17:01. Reason: For some reason, the code block got smushed into one long line. Fixed that.

                      Comment


                      • #12
                        It is panel data therefore it is more than once. I created the panel like this:

                        sysuse bindresp.dta
                        renpfix b
                        gen year=1992
                        save wave2

                        clear
                        sysuse cindresp.dta
                        renpfix c
                        gen year=1993
                        save wave3

                        sysuse wave1
                        append using wave2
                        append using wave3

                        But I think the year variable is not there anymore and there is not other identifying variable I think

                        Comment


                        • #13
                          year is still there I found it. Shall I replace all the dates with "year" in the code?
                          But then I get the error that year is not found...
                          Last edited by Stef Anie; 10 Dec 2016, 17:26.

                          Comment


                          • #14
                            But I think the year variable is not there anymore and there is not other identifying variable I think
                            Yes, it sounds like year is the other identifying variable. Assuming you are not going forward with analyses that actually need the year information, you can do a workaround as follows:

                            Code:
                            by pid, sort: gen int seq = _n
                            Now pid and seq will jointly identify unique observations. So you can replace date by seq in the code in #11 and you should be OK. Just remember later on that seq is just an arbitrary counter that does not necessarily run in the same order as the original years did. So if you later need to do an analysis the requires year information, you can't substitute seq for that purpose--you will need to go back and rebuild the data set and retain the year variable.

                            Comment


                            • #15
                              The variable seq is not found even though I typed your additional command in before the code. Do you know why?
                              I also tried to keep "year" as well and replace that with year as well but then the 1:1 merge had an error

                              If I keep "seq" as well then this happens at merging:

                              . merge 1:1 pid seq using `parent_ids', keep(match) nogenerate

                              variables pid seq do not uniquely identify observations in the using data
                              r(459);

                              Last edited by Stef Anie; 10 Dec 2016, 17:41.

                              Comment

                              Working...
                              X