Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging Datasets Without Uniquely Identified Observations

    Hello,
    I am using StataMP13 and have a question regarding the merging of two datasets.

    My master dataset has 59 variables and 624,929 observations. Two of the variables are County and Year.

    My secondary dataset has 8 variables and 1,785 observations, each observation corresponds to a County and Year.

    I am attempting to combine the two datasets so that each observation in my master dataset includes the corresponding data for the matching County and Year from the secondary dataset. I ran merge 1:m wua_year county using "..." and received error message r(459): variables wua_year county do not uniquely identify observations in the master data.

    I understand that the purpose of the merge command is to match observations by uniquely identifying observations in the master dataset. Is there a way to merge the datasets by applying the each observation for County, Year to multiple observations in my master dataset?

    Thank you.


  • #2
    Hi Bjorn,

    see -help merge-: There is not only "merge 1:m", but also its opposite "merge m:1". I think it does what you ask for, at least if I understood the question correctly.
    (Note that Stata technically also supports "merge m:m", but it should not be used. Never. Actually, no one ever showed an example to me where "m:m" would yield appropriately interpretable results.)

    As an additional note: You can use more than one variable as your merging identifier; you do not need to create a singe joint identifier variable. I guess
    Code:
    merge m:1 County Year using ...
    could do the trick for you.

    Regards
    Bela
    Last edited by Daniel Bela; 15 Sep 2017, 01:44. Reason: emphasize on not performing m:m merges

    Comment


    • #3
      Hi Daniel,

      Thank you for your reply. I have now tried "merge 1:1", "merge 1:m", and "merge m:1". As I reported in my previous post, "merge 1:m" has provided me with error message r(459), which is the same result I get for "merge 1:1".

      Finally, I ran "merge m:1". The code ran, however, none of the data were matched. The following is the output I receive from Stata.

      Click image for larger version

Name:	Merge.png
Views:	1
Size:	24.7 KB
ID:	1410579


      As you can see, both datasets remain unmatched. My understanding is that this is due to the fact that the observations are not uniquely identified by County, Year (which is by design because I need all single observations from the master dataset to be matched with the data from the secondary dataset that correspond to the correct County, Year). To reiterate, I am looking to match each observation from the secondary dataset to multiple observations in the master dataset, which at this point doesn't seem to be a function that merge supports. Does anyone have an idea of any other codes that might make this work?

      *Note: Your point concerning not using "merge m:m" was well taken; however, I did run "merge m:m", just to exhaust all possibilities, and received similar results to when running "merge m:1".

      Thanks,

      Bjorn

      Comment


      • #4
        Hi again,

        Originally posted by Bjorn Kallerud View Post
        As you can see, both datasets remain unmatched. My understanding is that this is due to the fact that the observations are not uniquely identified by County, Year (which is by design because I need all single observations from the master dataset to be matched with the data from the secondary dataset that correspond to the correct County, Year). To reiterate, I am looking to match each observation from the secondary dataset to multiple observations in the master dataset, which at this point doesn't seem to be a function that merge supports. Does anyone have an idea of any other codes that might make this work?
        your intuition leads you the wrong path here: If, as you made clear, your using dataset contains unique County-Year-combinations, and your using dataset contains uniquely defined County-Year-combinations, -merge m:1- is the way to go. This is why your merge-operation worked (i.e.: did not produce an error in Stata). And I want to repeat, as an addition, that you should never merge m:m. It will not yield interpretable data, except by coincidence.

        There is a second aspect in merging data, and this is what Stata tells you here: Even if a merge m:1 can be successfully executed, it may happen that observations do not match between the two datasets you're combining. In other terms, not all (or none at all) County-Year-Combinations can be found in both data sources.

        Without a proper example of the data you work with (please have a look at the FAQ to learn how to share a good data example using dataex) we can only guess the reason behind this problem.

        Regards
        Bela

        Comment


        • #5
          Thanks Bela,
          You were spot on, in that my intuition led me down the wrong path. I immediately saw why there were no matches when I ran when merge m:1, as I posted my data into the reply.

          The following is a small sample from the master dataset:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input long wr_id str3 county int wua_year
          30854 " AL" 2002
           3216 " AL" 2013
          30854 " AL" 2000
          30854 " AL" 2008
          29175 " AL" 2003
          end
          The following is a small set from the secondary dataset:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str2 county int wua_year double D0D4
          "AL" 2000 33.895961538461535
          "AL" 2001  12.88923076923077
          "AL" 2002  53.68037735849057
          "AL" 2003  45.75730769230769
          "AL" 2004 1.0555769230769232
          end
          The datasets were mismatched because of the leading space in the county variable in the master dataset. I used the ltrim function to get rid of the leading space in the master dataset, and successfully merged the datasets using merge m:1. Thank you very much for taking the time to help me solve this elementary problem. I am very grateful.

          Best,

          Bjorn

          Comment

          Working...
          X