Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mismatches in a panel data set after appending new data/merging two sets

    Dear all,

    I have a problem with appending data to my master panel data set. Both datasets contain the same number of variables and each observation is uniquely identified through three variables in both data sets (checked it with isid). I tried to merge the two data sets with "append" and "merge 1:1" and in both cases there are several mismatches in the appended data with no obvious pattern and the data from the master set stays unchanged. All variables have the same type in both data sets. Is there any other function I could use? Or do you have any other idea what the source of the problem could be?

    Many thanks in advance.

  • #2
    I don't understand what you did, nor what happened when you did it.

    Are you -append-ing or are you -merge-ing the two data sets? Those are very different operations, as I'm sure you know. With -append- there is no issue of matching: nothing in the two data sets needs to match for a successful append. In fact, with respect to the three identifier variables, you would ordinarily expect there not to be matches when you append, because if there are, you will no longer have those three variables as unique identifiers.

    If you are -merge-ing the data sets, then the expectation is usually that there will be some matches on those three key variables. But there can also be observations in the master only, or in the using only. If your expectation is that there shouldn't be, but Stata tells you there are, then the only conclusion is that something is wrong with one or both of the data sets and you need to investigate that.

    If you can clarify this, perhaps more specific and helpful advice can be given. I suggest using -dataex- to show example data from both data sets, and show the exact command(s) you are using. And choose your example data so that it illustrates the problem you are encountering. Also, explain what it is about the results you are getting that differs from what you expect/want. That would make it possible to troubleshoot the situation.

    If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      Hello Clyde,

      Thanks for your feedback. At the moment I am not allowed to publish my data. So using -dataex- is probably not an option right now. At the moment I am just trying to add observations to my data to increase the probability of significant results in my regression analysis. Therefore, I have cleansed additional data from an excel sheet and saved them as stata data.

      In the following, I present the commands I used.

      Using "merge":

      . use "C:\Users\Eric Schröder\OneDrive\Daten1821 AKTUELL Hautfarbe.dta"

      . sort Player season gameday

      . isid Player season gameday
      .

      . merge 1:1 season gameday Player using "C:\Users\Eric Schröder\OneDrive\Daten1617 AKTUELL Nationalität.dta"
      (label club_name already defined)
      (label Player already defined)

      Result Number of obs
      -----------------------------------------
      Not matched 43,431
      from master 34,937 (_merge==1)
      from using 8,494 (_merge==2)

      Matched 0 (_merge==3)
      -----------------------------------------

      Using "append":

      use "C:\Users\Eric Schröder\OneDrive\Daten1821 AKTUELL Hautfarbe.dta"

      .
      . sort id season gameday

      .
      . isid Player season gameday

      .
      . order Player id club_name season position ethn_ID date gameday ppg m_played quote type nationality status minutes ethnicity year skin_color

      . append using "C:\Users\Eric Schröder\OneDrive\Daten1617 AKTUELL Nationalität.dta"
      (label club_name already defined)
      (label Player already defined)


      In both cases the 8494 observations added get messy across all observations of the categorical variable club_name.

      Hopefully this will help.


      Comment


      • #4
        I will guess that you used the encode command to create numeric variables Player and club_name from string variables. This was a mistake, because the results of using encode depend very much on the actual data in each dataset. So if you run the command
        Code:
        label list club_name Player
        in each of your two datasets you will find that the same club or player may be encoded to different numeric values.

        Similarly when you append the data, the two different encodings don't match, and the one in the first dataset takes precedence.

        You first merge or append (whichever makes sense, it seems to me that append is what you want since your identifiers are player, season, and gameday) and then encode the string values into numeric values.
        Last edited by William Lisowski; 10 Mar 2022, 10:27.

        Comment


        • #5
          Hello William,

          this seems reasonable. I will give it a try. Thanks a lot!

          Comment

          Working...
          X