Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • A Data Merging Problem



    I am merging different datasets using "merge 1:1" option in STATA 17. I merge on two key variables: "country" or "ccode" and year. I want to import "uaeexport" variable into my first dataset. Country codes are unique. I really don't understand why I couldn't do it.

    Year: Int
    Iso: Float



    My first dataset: (Master data)

    country UAEODA iso ccode year turkeyoda

    Aruba 0 533 ABW 2000 0
    Aruba 0 533 ABW 2001 0
    Aruba 0 533 ABW 2002 0
    Aruba 0 533 ABW 2003 0
    Aruba 0 533 ABW 2004 0


    My Second Dataset:

    country year uaeexport iso ccode
    Afghanistan 2000 129559.63 4 AFG
    Afghanistan 2001 80161.099 4 AFG
    Afghanistan 2002 190925.73 4 AFG
    Afghanistan 2003 163179.8 4 AFG
    Afghanistan 2004 262686.2 4 AFG
    Afghanistan 2005 469253.61 4 AFG

    (the order of the countries is different)

    My code:

    merge 1:1 ccode year using uaeexport

    Error:

    variables ccode year do not uniquely identify observations in the master data


  • #2
    Stata is telling you that the variables ccode and year do not uniquely identify observations in the master dataset. This indicates that there is at least one observation in that dataset which is a duplicate of a previously occurring country-year pair.

    In order to solve this problem, first identify the problematic observations using:

    Code:
    duplicates list ccode year
    or

    Code:
    duplicates tag ccode year, gen(dups)
    browse if dups
    Identifying those observations will be the first clue to diagnosing how the data error came to be. It may have been introduced during a previous stage of the data cleaning process, or during your import of the data, or it could be an error within the original data source. In any event, you should retrace your steps to find out where the problem originated, and make appropriate adjustments to your process to fix it before attempting the merge again.
    Last edited by Ali Atia; 19 Apr 2022, 18:14.

    Comment


    • #3
      I don't want a verbal description of your data, I just want your data. Please give your example data using the dataex command. For us to provide meaningful feedback, you must provide your example data using the dataex command, the real data from an easily importable source (i.e., Github), or the equivalent of a toy example.
      Otherwise, anything we say is simply a waste of time. Note, that I'm not trying to be mean in saying this, I'm saying this because if we can't see your dataset as you do with a minimal worked example, anything we suggest is just guesswork. The reason that I'm emphasizing this is because questions like this one likely have a relatively simple fix, but even simple fixes can be wildly overcomplicated without a minimal worked example of a dataset and code that you've tried to accomplish your task.

      So please, provide us with your example data that encapsulates the problem and I'm more than okay with helping you solve this.

      Specifically, give one dataex for your master data and another dataex for your using data. Nihat Mugurtay

      Comment


      • #4
        Dear Jared, this is what dataex command gives me. The first one is master. Actually, the number of obs is different. This is okay for me. I just want 3696 obs in my main dataset. Therefore, I want to have countries (from using data) that are located in my master data. @Jared Greathouse

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str45 country str18 UAEODA float iso str3 ccode int year str17    turkeyoda
        "Aruba"       "0"                 533 "ABW" 2000 "0"                
        "Aruba"       "0"                 533 "ABW" 2001 "0"                
        "Aruba"       "0"                 533 "ABW" 2002 "0"                
        "Aruba"       "0"                 533 "ABW" 2003 "0"                
        "Aruba"       "0"                 533 "ABW" 2004 "0"                
        "Aruba"       "0"                 533 "ABW" 2005 "0"                
        "Aruba"       "0"                 533 "ABW" 2006 "0"                
        "Aruba"       "0"                 533 "ABW" 2007 "0"                
        "Aruba"       "0"                 533 "ABW" 2008 "0"                
        "Aruba"       "0"                 533 "ABW" 2009 "0"                
        "Aruba"       "0"                 533 "ABW" 2010 "0"                
        "Aruba"       "0"                 533 "ABW" 2011 "0"                
        "Aruba"       "0"                 533 "ABW" 2012 "0"                
        "Aruba"       "0"                 533 "ABW" 2013 "0"                
        "Aruba"       "0"                 533 "ABW" 2014 "0"                
        "Aruba"       "0"                 533 "ABW" 2015 "0"                
        "Aruba"       "0"                 533 "ABW" 2016 "0"                
        "Aruba"       "0"                 533 "ABW" 2017 "0"                
        "Aruba"       "0"                 533 "ABW" 2018 "0"                
        "Aruba"       "0"                 533 "ABW" 2019 "0"                
        "Aruba"       "1.21"              533 "ABW" 2020 "0"                
        "Afghanistan" ".06"                 4 "AFG" 2000 ".07"              
        "Afghanistan" "8.550000000000001"   4 "AFG" 2001 ".59"              
        "Afghanistan" "31.79"               4 "AFG" 2002 ".47"              
        "Afghanistan" "3.79"                4 "AFG" 2003 ".6899999999999999"
        "Afghanistan" "9.85"                4 "AFG" 2004 "7.32"            
        "Afghanistan" "22.61"               4 "AFG" 2005 "20.94"            
        "Afghanistan" "6.54"                4 "AFG" 2006 "41.24"            
        "Afghanistan" "1026.68"             4 "AFG" 2007 "43.85"            
        "Afghanistan" "12.97"               4 "AFG" 2008 "78.11"            
        "Afghanistan" "47.46"               4 "AFG" 2009 "59.46"            
        "Afghanistan" "21.47"               4 "AFG" 2010 "59.94"            
        "Afghanistan" "36.13"               4 "AFG" 2011 "75.34999999999999"
        "Afghanistan" "93.81999999999999"   4 "AFG" 2012 "87.16"            
        "Afghanistan" "57.33"               4 "AFG" 2013 "51.56"            
        "Afghanistan" "78.45999999999999"   4 "AFG" 2014 "40.11"            
        "Afghanistan" "60.33"               4 "AFG" 2015 "40.27"            
        "Afghanistan" "44.35"               4 "AFG" 2016 "23.8"            
        "Afghanistan" "30.91"               4 "AFG" 2017 "28.92"            
        "Afghanistan" "11.56"               4 "AFG" 2018 "29.1"            
        "Afghanistan" "4.95"                4 "AFG" 2019 "30.6"            
        "Afghanistan" "122.02"              4 "AFG" 2020 "36.47"            
        "Angola"      "0"                  24 "AGO" 2000 "0"                
        "Angola"      "0"                  24 "AGO" 2001 "0"                
        "Angola"      "0"                  24 "AGO" 2002 "0"                
        "Angola"      "0"                  24 "AGO" 2003 "0"                
        "Angola"      ".05"                24 "AGO" 2004 "0"                
        "Angola"      ".02"                24 "AGO" 2005 "0"                
        "Angola"      ".01"                24 "AGO" 2006 "0"                
        "Angola"      "0"                  24 "AGO" 2007 ".01"              
        "Angola"      "0"                  24 "AGO" 2008 ".01"              
        "Angola"      "0"                  24 "AGO" 2009 "0"                
        "Angola"      "0"                  24 "AGO" 2010 "0"                
        "Angola"      ".02"                24 "AGO" 2011 ".02"              
        "Angola"      "0"                  24 "AGO" 2012 ".01"              
        "Angola"      "0"                  24 "AGO" 2013 ".05"              
        "Angola"      "0"                  24 "AGO" 2014 ".04"              
        "Angola"      "0"                  24 "AGO" 2015 ".06"              
        "Angola"      ".01"                24 "AGO" 2016 ".04"              
        "Angola"      "0"                  24 "AGO" 2017 ".02"              
        "Angola"      "0"                  24 "AGO" 2018 ".05"              
        "Angola"      "0"                  24 "AGO" 2019 ".11"              
        "Angola"      "2.47"               24 "AGO" 2020 ".11"              
        "Anguilla"    "0"                 660 "AIA" 2000 "0"                
        "Anguilla"    "0"                 660 "AIA" 2001 "0"                
        "Anguilla"    "0"                 660 "AIA" 2002 "0"                
        "Anguilla"    "0"                 660 "AIA" 2003 "0"                
        "Anguilla"    "0"                 660 "AIA" 2004 "0"                
        "Anguilla"    "0"                 660 "AIA" 2005 "0"                
        "Anguilla"    "0"                 660 "AIA" 2006 "0"                
        "Anguilla"    "0"                 660 "AIA" 2007 "0"                
        "Anguilla"    "0"                 660 "AIA" 2008 "0"                
        "Anguilla"    "0"                 660 "AIA" 2009 "0"                
        "Anguilla"    "0"                 660 "AIA" 2010 "0"                
        "Anguilla"    "0"                 660 "AIA" 2011 "0"                
        "Anguilla"    "0"                 660 "AIA" 2012 "0"                
        "Anguilla"    "0"                 660 "AIA" 2013 "0"                
        "Anguilla"    "0"                 660 "AIA" 2014 "0"                
        "Anguilla"    "0"                 660 "AIA" 2015 "0"                
        "Anguilla"    "0"                 660 "AIA" 2016 "0"                
        "Anguilla"    "0"                 660 "AIA" 2017 "0"                
        "Anguilla"    "0"                 660 "AIA" 2018 "0"                
        "Anguilla"    "0"                 660 "AIA" 2019 "0"                
        "Anguilla"    "11.72"             660 "AIA" 2020 "0"                
        "Albania"     ".58"                 8 "ALB" 2000 ".45"              
        "Albania"     ".61"                 8 "ALB" 2001 ".71"              
        "Albania"     ".59"                 8 "ALB" 2002 "2.05"            
        "Albania"     ".98"                 8 "ALB" 2003 ".45"              
        "Albania"     ".72"                 8 "ALB" 2004 "3.95"            
        "Albania"     "1.07"                8 "ALB" 2005 "5.12"            
        "Albania"     "1.48"                8 "ALB" 2006 "5.11"            
        "Albania"     "1.4"                 8 "ALB" 2007 "2.63"            
        "Albania"     "1.75"                8 "ALB" 2008 "2.4"              
        "Albania"     ".8100000000000001"   8 "ALB" 2009 "5.31"            
        "Albania"     ".14"                 8 "ALB" 2010 "4.33"            
        "Albania"     "0"                   8 "ALB" 2011 "2.06"            
        "Albania"     "17.35"               8 "ALB" 2012 "4.53"            
        "Albania"     "26.97"               8 "ALB" 2013 "5.31"            
        "Albania"     "5.6"                 8 "ALB" 2014 "7.87"            
        "Albania"     "5.91"                8 "ALB" 2015 "60.8"            
        end
        copy up to and including the previous line ------ --------- -

        Listed 100 out of 3696 observations
        Use the count() option to list more


        The using Data:



        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str30 country int year double uaeexport float iso str3 ccode
        "Afghanistan"    2000          129559.63  4 "AFG"
        "Afghanistan"    2001          80161.099  4 "AFG"
        "Afghanistan"    2002          190925.73  4 "AFG"
        "Afghanistan"    2003         163179.804  4 "AFG"
        "Afghanistan"    2004         262686.196  4 "AFG"
        "Afghanistan"    2005         469253.611  4 "AFG"
        "Afghanistan"    2006         279935.872  4 "AFG"
        "Afghanistan"    2007         355729.571  4 "AFG"
        "Afghanistan"    2008         712980.032  4 "AFG"
        "Afghanistan"    2009        1127472.354  4 "AFG"
        "Afghanistan"    2010        2238404.679  4 "AFG"
        "Afghanistan"    2011        1713395.488  4 "AFG"
        "Afghanistan"    2012        2578721.891  4 "AFG"
        "Afghanistan"    2013   2138731.60901543  4 "AFG"
        "Afghanistan"    2014   1829166.16692711  4 "AFG"
        "Afghanistan"    2015   2175131.08403845  4 "AFG"
        "Afghanistan"    2016   1596610.73590114  4 "AFG"
        "Afghanistan"    2017 2175392.0286175897  4 "AFG"
        "Afghanistan"    2018 1832789.5474848398  4 "AFG"
        "Afghanistan"    2019 1620238.4089977501  4 "AFG"
        "Albania"        2000            292.231  8 "ALB"
        "Albania"        2001           1096.097  8 "ALB"
        "Albania"        2002           3803.337  8 "ALB"
        "Albania"        2003           3133.156  8 "ALB"
        "Albania"        2004            387.881  8 "ALB"
        "Albania"        2005            334.701  8 "ALB"
        "Albania"        2006            716.628  8 "ALB"
        "Albania"        2007           1983.239  8 "ALB"
        "Albania"        2008           2740.868  8 "ALB"
        "Albania"        2009           3535.364  8 "ALB"
        "Albania"        2010           5414.752  8 "ALB"
        "Albania"        2011           4181.619  8 "ALB"
        "Albania"        2012           6081.387  8 "ALB"
        "Albania"        2013    9302.1643468402  8 "ALB"
        "Albania"        2014  9840.942869939061  8 "ALB"
        "Albania"        2015   11300.4452803933  8 "ALB"
        "Albania"        2016   11867.2634878085  8 "ALB"
        "Albania"        2017 11816.271377086001  8 "ALB"
        "Albania"        2018 19667.759412091098  8 "ALB"
        "Albania"        2019   17271.3670516113  8 "ALB"
        "Algeria"        2000         186246.464 12 "DZA"
        "Algeria"        2001         131168.922 12 "DZA"
        "Algeria"        2002         226970.444 12 "DZA"
        "Algeria"        2003         333671.476 12 "DZA"
        "Algeria"        2004         409771.462 12 "DZA"
        "Algeria"        2005         355100.804 12 "DZA"
        "Algeria"        2006         410023.623 12 "DZA"
        "Algeria"        2007         432272.707 12 "DZA"
        "Algeria"        2008         344649.124 12 "DZA"
        "Algeria"        2009         295110.662 12 "DZA"
        "Algeria"        2010         303995.872 12 "DZA"
        "Algeria"        2011         371869.346 12 "DZA"
        "Algeria"        2012         664977.516 12 "DZA"
        "Algeria"        2013   417679.543392882 12 "DZA"
        "Algeria"        2014   480468.341606987 12 "DZA"
        "Algeria"        2015   440129.006791925 12 "DZA"
        "Algeria"        2016   374836.964792945 12 "DZA"
        "Algeria"        2017  551068.4169413709 12 "DZA"
        "Algeria"        2018  688539.9963628381 12 "DZA"
        "Algeria"        2019   642730.487111004 12 "DZA"
        "American Samoa" 2000                  . 16 "ASM"
        "American Samoa" 2001                  . 16 "ASM"
        "American Samoa" 2002                  . 16 "ASM"
        "American Samoa" 2003                  . 16 "ASM"
        "American Samoa" 2004                  . 16 "ASM"
        "American Samoa" 2005                  . 16 "ASM"
        "American Samoa" 2006                  . 16 "ASM"
        "American Samoa" 2007                  . 16 "ASM"
        "American Samoa" 2008                  . 16 "ASM"
        "American Samoa" 2009                  . 16 "ASM"
        "American Samoa" 2010            135.414 16 "ASM"
        "American Samoa" 2011                  . 16 "ASM"
        "American Samoa" 2012                  . 16 "ASM"
        "American Samoa" 2013                  . 16 "ASM"
        "American Samoa" 2014                  . 16 "ASM"
        "American Samoa" 2015        25.00339655 16 "ASM"
        "American Samoa" 2016                  . 16 "ASM"
        "American Samoa" 2017        7.054592952 16 "ASM"
        "American Samoa" 2018                  . 16 "ASM"
        "American Samoa" 2019   322.864927849035 16 "ASM"
        "Andorra"        2000             12.555 20 "AND"
        "Andorra"        2001                  . 20 "AND"
        "Andorra"        2002                  . 20 "AND"
        "Andorra"        2003            103.229 20 "AND"
        "Andorra"        2004            364.755 20 "AND"
        "Andorra"        2005           1360.197 20 "AND"
        "Andorra"        2006             1076.7 20 "AND"
        "Andorra"        2007            925.669 20 "AND"
        "Andorra"        2008           1726.153 20 "AND"
        "Andorra"        2009           1312.169 20 "AND"
        "Andorra"        2010           3998.507 20 "AND"
        "Andorra"        2011           1103.479 20 "AND"
        "Andorra"        2012             32.996 20 "AND"
        "Andorra"        2013     148.6170168681 20 "AND"
        "Andorra"        2014     181.1131001867 20 "AND"
        "Andorra"        2015 4.7106862000000005 20 "AND"
        "Andorra"        2016       89.855930824 20 "AND"
        "Andorra"        2017       98.172334172 20 "AND"
        "Andorra"        2018    23.447089743363 20 "AND"
        "Andorra"        2019     4.683039187086 20 "AND"
        end
        copy up to and including the previous line ---- -------------

        Listed 100 out of 4600 observations
        Use the count() option to list more

        Last edited by Nihat Mugurtay; 19 Apr 2022, 18:32.

        Comment


        • #5
          Okay so as per #2, what happens when you do
          Code:
          duplicates list ccode year

          Comment


          • #6
            Well, even just glancing at the example data shown it is blatantly obvious that ccode does not uniquely identify observations in either data set. Nor does country. Clearly, both of these are panel data sets and unique identification of observations requires not just the panel but the time variable.

            Code:
            merge 1:1 ccode year using uaeexport
            works with the example data. But the fact that it gives you an error message implies that somewhere in the full data sets you have some observations that agree on country and year: that should not happen! Find them with
            Code:
            duplicates tag ccode year, gen(flag)
            browse if flag
            Then you have to go about fixing the data by eliminating the offending surplus observations. Just how to go about that will depend on how those got there in the first place and whether there is some way to pick one correct one for each ccode-year group, or whether they must be combined in someway. In the end, though, you have to get your data down to a single observation for each ccode-year group in each data set. And you should not be content to just purge the problems: you should also investigate how they got there in the first place. Something evidently went wrong in putting these data sets together, and there may well be other errors that have not yet tripped you up. Better to find and fix them now than get tripped up by them further on in your work when it will be harder to figure out, or might even have deleterious consequences.
            Last edited by Clyde Schechter; 19 Apr 2022, 18:39.

            Comment


            • #7
              Along with Clydes comments, I'll give a little more advice.

              I break up my Stata Code into multiple files. MULTIPLE files. One Master file to rule them all, a file which runs two master files for analysis and data cleaning. Each master file for cleaning and analysis runs sub files which are intended to do a very specific job, beginning with obtaining the data (yes I integrate this into my code, Python and all), as well as the code to clean it and ensure it does as I want.

              If I have a file called 2_0Pop where I import county level population data, after my cleaning, I will do the equivalent of
              Code:
              isid sid cid year
              
              // state id, county id
              To ensure that state, county and year ID uniquely identify each other. If this test passes, I proceed. If not, Stata throws an error. This ENSURES beyond doubt that all your identifiers are unique and. Additionally, I'll do things like
              Code:
              as pop >0
              to ensure that I, by some improbable chance, don't have values that're nonsense.

              Writing simple unit tests like these WHILE YOU'RE MAKING AND CLEANING YOUR DATA go quite a long way in diagnosing issues like this.

              I should mention that I'm not perfect about this, cuz I'm a human too, but simple assert tests and others will go a very long way in ensuring that you're not pulling your hair out because "I just knew that these are in fact unique IDs" when in reality, you changed your code from two days ago and now, surprise surprise, something went wrong.

              If you can run code that you wrote 3 months ago, and it gives you the exact same result (barring certain technical errors, ie bad URLs), you've done your job well.

              Comment


              • #8
                Dear all,

                I tried what you told me, and yes, I worked it out for some datasets, not for others. I easily merged datasets from one source (i.e, worldbank), but I couldn't merge them with other datasets from different sources (such as OECD). I think inconsistencies result from data sources, and ı will fix them. Thank you, @Ali Atia, @Clyde Schechter, @Jared Greathouse.

                Best,

                Comment

                Working...
                X