Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging multilevel data using an identifier that is not unique

    I am trying to merge 3 census datasets. Personal, family and household. The first one has 3 identifiers HID, PID and FID. The second one has two HID and PID while the last one HID as the only identifier.

    HID is a string and unique only in the 2nd dataset (household) but not in the others. I can merge successfully the second file with either using HID and i get a complete 100% merge.

    But whenever i try to merge this new merged dataset with the remaining datafile having dropped my _merge variable, i get this error
    "variable HID does not uniquely identify observations in the master data"

    Can anyone help with ideas on how to get around this one.

    Thanks
    Sunganani Kalemba
    PhD Student.
    Queensland

  • #2
    I am a bit confused.

    I understand the personal dataset contains identifiers PID (person ID), FID (family ID), and HID (household ID).

    I understand the household dataset contains only the identifier HID.

    You state that the family dataset contains the identifiers HID and PID. I expect you meant it contains identifiers FID and HID.

    In general, I would expect your merging to include the following
    Code:
    use personal, clear
    ...
    merge m:1 FID HID using family
    ...
    merge m:1 HID using household
    ...

    Comment


    • #3
      William,

      Thanks for your response. Yes you are right on the identifiers. My data files refuse to upload, but here is how the data looks like

      Code:
      [ use "individual.dta"

      . dataex HID FID HID FID PID AREAP AGEP ALSP BPFP in 1/10

      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str9 HID byte(FID PID AREAP AGEP ALSP BPFP)
      "PC9636522" 1 1 20 28 6 7
      "PC9600813" 1 1 1 34 5 1
      "PC9659736" 1 1 32 27 5 1
      "PC9666703" 2 2 37 3 10 1
      "PC9671295" 1 1 39 32 3 1
      "PC9601184" 1 1 1 34 3 18
      "PC9612648" 1 1 7 28 6 1
      "PC9624692" 1 1 13 32 3 1
      "PC9614047" 1 3 8 7 10 1
      "PC9605204" 1 1 3 30 4 2
      end

      clear all

      . use "household.dta"

      . dataex HID HIND AREAD CPAD DWTD FUFD HHTD in 1/10

      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str9 HID byte(HIND AREAD CPAD DWTD FUFD HHTD)
      "PC9653904" 6 29 5 1 5 3
      "PC9639199" 13 21 1 1 5 1
      "PC9613610" 16 8 1 1 5 1
      "PC9617019" 6 9 5 1 2 3
      "PC9638371" 7 21 1 1 5 1
      "PC9625033" 16 13 1 1 3 1
      "PC9612768" 12 7 1 1 3 1
      "PC9615131" 7 9 1 1 3 1
      "PC9632121" 7 17 5 1 5 3
      "PC9673798" 8 40 1 1 3 4
      end
      [/CODE]


      clear all

      . use "family.dta"

      . dataex HID CPAF FID AREAF CDCAF CDSAF CNDAF CPAF in 1/10

      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str9 HID byte(CPAF FID AREAF CDCAF CDSAF CNDAF)
      "PC9626738" 1 1 14 3 3 3
      "PC9625761" 1 1 14 1 3 3
      "PC9675335" 3 1 41 3 3 3
      "PC9657703" 3 1 31 3 3 3
      "PC9605144" 1 1 3 3 3 3
      "PC9627119" 1 1 14 1 3 3
      "PC9634359" 3 1 18 3 3 3
      "PC9600860" 3 2 1 3 3 3
      "PC9601032" 3 1 1 3 3 3
      "PC9672084" 2 1 39 3 3 3
      end

      When i run the following merge commands as you advice


      Code:
      [clear all

      . use "individual.dta"

      . sort HID PID PID

      . merge m:1 HID FID using "family.dta"

      Result # of obs.
      -----------------------------------------
      not matched 0
      matched 178,198 (_merge==3)
      -----------------------------------------

      . keep if _merge==3
      (0 observations deleted)

      . drop _merge

      . merge m:1 HID using "household.dta"

      Result # of obs.
      -----------------------------------------
      not matched 6,633
      from master 0 (_merge==1)
      from using 6,633 (_merge==2)

      matched 178,198 (_merge==3)
      -----------------------------------------

      .

      I am still getting the same error i.e. zero merges with the third (household) file.

      Your thoughts will be greatly appreciated.

      Best regards,

      Sunga
      Sunganani Kalemba
      PhD Student.
      Queensland

      Comment


      • #4
        You have misinterpreted the output of your second merge.

        It first tells you that there were 6,633 observations in the merged dataset that appeared in only one of the input datasets and were unmatched in the other.

        It then tells you that 0 unmatched observations came from your "master" dataset which in this case was the output of your first merge. So every observation in the output of your first merge was matched in the household dataset.

        It then tells you that 6,633 unmatched observations came from your "using" dataset - the household dataset. So the household dataset had 6,633 observations that did not correspond to anyone in the output of your first merge. I wonder if they were households that failed to participate in the survey.

        It then tells you that 178,198 observations - the total number in your master dataset - were matched.

        Comment


        • #5
          William,

          You are right. My apologies, i did misinterpret the results. Its my first time in a long time using stata and the first time using survey data.

          I drop the merge variables and keep the matched vars after each merge. Then I run the "describe" command and see the number of variables has progressively increased with each merge.

          Thanks a bunch!

          Best wishes.
          Sunganani Kalemba
          PhD Student.
          Queensland

          Comment

          Working...
          X