Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Error when using merge

    Dear All,

    I have a master data set, called master, with approx. 100.000 firms. One of the variables is the city where the firm is located. There are multiple firms within each city in my data. I want to add some city characteristics to the master data set. I have a separate data set, called mergedata where the cities are listed uniquely with different variables. I have tried the following:

    use master
    merge m:1 city using mergedata

    This gives the error: "variable city does not uniquely identify observations in the using data"

    What am I doing wrong? I have checked that the variable "city" is uniquely identified in mergedata.

    Thanks for your help!

    Best,
    Fredrik Bakkemo

  • #2
    Hi, first of all you should use your real name, see Stata FAQ (which most definitely isn't ladf?!), you can contact the admins to change that. Also: This is a fairly trivial question (no offence, but there are tons of Youtube tutorials about it and the question also has been asked multiple times on this forum). See below.
    Last edited by Jon Hoefer; 10 Feb 2020, 16:21.

    Comment


    • #3
      I'm sorry to say, but a merge m:m is not the right thing to do.

      The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

      m:m merges

      m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

      Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
      If a merge m:m seems like the what you need, it is a near certainty that at least one of the following is true:

      1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

      2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

      3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

      4. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.

      So the problem is, it seems to me, that you haven't told us how you "have checked that the variable "city" is uniquely identified in mergedata" but I suspect you overlooked something.

      When you have your mergedata dataset in memory,
      Code:
      isid city
      will confirm or refute that city uniquely identifies observations. Perhaps you need a combination of city and a higher level of geography. If as I expect isid confirms that you have a problem with city, then the duplicates command can help you identify the problem.
      Code:
      duplicates report city
      duplicates examples city
      See the output of help duplicates for more features of this command.

      Comment


      • #4
        si tacuisses, philosophus mansisses... thanks for correcting. I in fact always sorted the data first and then used m:m but as you point out, this is suboptimal.

        Comment


        • #5
          Originally posted by Jon Hoefer View Post
          Hi, first of all you should use your real name, see Stata FAQ (which most definitely isn't ladf?!), you can contact the admins to change that. Also: This is a fairly trivial question (no offence, but there are tons of Youtube tutorials about it and the question also has been asked multiple times on this forum). See below.
          Thank's for pointing that out - I wasn't aware. I will make sure to contact admins to change the nick to my real name! Speaking of good forum practice, you should read the question carefully and make sure you actually know the answer to the question before you frame it as trivial. Furthermore, you should avoid editing mistakes in previous posts when corrected by other users. Such editing is confusing to both me and others who might find the thread helpful.

          Comment


          • #6
            Originally posted by William Lisowski View Post
            I'm sorry to say, but a merge m:m is not the right thing to do.

            [...]

            When you have your mergedata dataset in memory,
            Code:
            isid city
            will confirm or refute that city uniquely identifies observations. Perhaps you need a combination of city and a higher level of geography. If as I expect isid confirms that you have a problem with city, then the duplicates command can help you identify the problem.
            Code:
            duplicates report city
            duplicates examples city
            See the output of help duplicates for more features of this command.
            Thank you for a really instructive answer! -isid city- made me aware of my problem. I had only checked for duplicates in my raw data in Excel. It turns out Stata imported two additional (supposedly) empty rows, which caused the variable "city" not to be uniquely identified. Thank's again!

            Best,

            Fredrik Bakkemo

            Comment


            • #7
              It turns out Stata imported two additional (supposedly) empty rows ...
              You will find this is not an uncommon problem when importing Excel worksheets. I believe the problem lies in the mechanism Excel uses to keep track of the lower-right corner of the worksheet and that, were you to examine the xlsx file, you would find those two blank rows there as well.

              When using import excel it's a good idea to use the cellrange() and firstrow() options when possible to ensure you get what you're expecting, with no bonuses added in. Or, after importing the worksheet, use Stata's Data Browser and confirm that the rightmost variables and bottommost observations of your data are as you expect them to be.

              Comment


              • #8
                Thanks for the tip! I’ll certainly be careful to check this next time.

                Comment


                • #9
                  And thank you, Fredrik Bakkemo, for taking the time to have the admins change your identity. We do strive for collegiality here on Statalist.

                  Comment

                  Working...
                  X