Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets

    Hello,

    I have some questions regarding merging datasets. In one dataset, I have industry level data on value added, employment, wages at the 4 and 3 digit industry level by country and year. In the other dataset, I also have data at the 4 digit but on exports and imports. My question is how to merge these datasets.

    The code I am trying is merge 1:1 country isic year {link for using dataset}. The thing is that I do think a lot of variables are not matched when they should be (obviously the 3 digit level in the master dataset and isic=="D" will not match). Any suggestions? Does it matter which one I pick as the master and the using? Because I tried both ways and matches are different

    Master dataset

    Code:
     * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(country year) str4 isic str5 isiccomb byte sourcecode long(Establishments Employment) double(Wages OutputINDSTAT4 ValueAdded GrossFixed) long FemaleEmployees
    4 1998 "D"    "D"    1   .     .          .          . . .   231
    4 1999 "D"    "D"    1   .     .          .          . . .    20
    4 2001 "1511" "1511" 1   1     5          .          . . .     .
    4 2001 "1513" "1513" 1  10    15          .          . . .     .
    4 2001 "1514" "1514" 1   6    14          .          . . .     .
    4 2001 "1541" "1541" 1   1     5          .          . . .     .
    4 2001 "1554" "1554" 1   7     8          .          . . .     .
    4 2001 "1712" "1712" 1   2    15          .          . . .     .
    4 2001 "1722" "1722" 1   2    86          .          . . .     .
    4 2001 "1920" "1920" 1   1    30          .          . . .     .
    4 2001 "2211" "2211" 1   3     .          .          . . .     .
    4 2001 "2212" "2212" 1   .   129          .          . . .     .
    4 2001 "2219" "2219" 1   9    13          .          . . .     .
    4 2001 "2412" "2412" 1   1   298          .          . . .     .
    4 2001 "2520" "2520" 1  44     .          .          . . .     .
    4 2001 "2694" "2694" 1   2    96          .          . . .     .
    4 2001 "3699" "3699" 1  53    25          .          . . .     .
    4 2001 "D"    "D"    1 153   739          .          . . .    52
    4 2002 "154"  "154"  1  57  1398   58716000  3.650e+08 . .   270
    4 2002 "1722" "1722" 1  11   830   38860000   67000000 . .   710
    4 2002 "2029" "2029" 1  14   723   30780000    9000000 . .     .
    4 2002 "221"  "221"  1  33   680   28260000    2000000 . .    77
    4 2002 "242"  "242"  1   5   296   11628000          . . .   150
    4 2002 "2520" "2520" 1  69  1212   67704000   40000000 . .     .
    4 2002 "2695" "2695" 1   .  2140   68880000          . . .     .
    4 2002 "289"  "289"  1  32   344   11248000   11000000 . .     .
    4 2002 "3699" "3699" 1   6  1389   63444000    2000000 . .   106
    4 2002 "D"    "D"    1 232  9012  379520000  4.960e+08 . .  1313
    4 2003 "154"  "154"  1  68  1986   90564000  1.460e+08 . .   384
    4 2003 "1722" "1722" 1  22  2371  119134000  1.660e+08 . .  1800
    4 2003 "2029" "2029" 1  10   690   31464000   32000000 . .     .
    4 2003 "221"  "221"  1  34   715   42604000   32000000 . .   125
    4 2003 "242"  "242"  1   6   921   41000000   95000000 . .   380
    4 2003 "2520" "2520" 1  52  1130   51528000   97000000 . .     .
    4 2003 "2695" "2695" 1  48  1823   83124000   65000000 . .     .
    4 2003 "289"  "289"  1  58   997   45468000   35000000 . .     .
    4 2003 "3699" "3699" 1  14  2078   95172000   31000000 . .   699
    4 2003 "D"    "D"    1 312 12711  600058000  6.990e+08 . .  3388
    4 2004 "154"  "154"  1  60  2170  104160000  3.580e+08 . .   720
    4 2004 "1722" "1722" 1   8  1870   89760000  1.370e+08 . .  1709
    4 2004 "2029" "2029" 1  18   970   46560000  2.170e+08 . .     .
    4 2004 "221"  "221"  1  44   990   34320000   73000000 . .   360
    4 2004 "242"  "242"  1  11  1170   56160000  2.040e+08 . .   720
    4 2004 "2520" "2520" 1  74  1500   72000000  1.840e+08 . .     .
    4 2004 "2695" "2695" 1  36  1640   78720000  1.270e+08 . .     .
    4 2004 "289"  "289"  1  68  1200   57600000   83000000 . .     .
    4 2004 "3699" "3699" 1  28  2625  1.260e+08   46000000 . .  1020
    4 2004 "D"    "D"    1 347 14135  665280000  1.429e+09 . .  4529
    4 2005 "154"  "154"  1 152  4376  2.208e+08  1.208e+09 . .   561
    4 2005 "1722" "1722" 1  31  4474  2.256e+08  7.880e+08 . .  3900
    4 2005 "2029" "2029" 1  43  1500   70560000  3.700e+08 . .     .
    4 2005 "221"  "221"  1  47  1380   81552000  1.200e+08 . .   345
    4 2005 "242"  "242"  1   7  1270   53928000  1.670e+08 . .   790
    4 2005 "2520" "2520" 1  45  1600   80680000  6.460e+08 . .     .
    4 2005 "2695" "2695" 1  65  3330  167832000  1.397e+09 . .     .
    4 2005 "289"  "289"  1  54  1190   54936000  2.930e+08 . .     .
    4 2005 "3699" "3699" 1 123  3584  180636000  1.156e+09 . .  1090
    4 2005 "D"    "D"    1 567 22704 1136524000  6.145e+09 . .  6686
    4 2006 "154"  "154"  1 199  5278  269616000  1.419e+09 . .   901
    4 2006 "1722" "1722" 1  49  5808  313632000  9.640e+08 . .  5000
    4 2006 "2029" "2029" 1  47  1785   91260000  3.820e+08 . .     .
    4 2006 "221"  "221"  1  56  2180  1.134e+08  1.360e+08 . .   459
    4 2006 "242"  "242"  1  10  1800   97200000  1.900e+08 . .  1102
    4 2006 "2520" "2520" 1  58  1930  104220000  7.160e+08 . .     .
    4 2006 "2695" "2695" 1  84  4900  2.838e+08  1.670e+09 . .     .
    4 2006 "289"  "289"  1  78  2320  125280000  3.540e+08 . .     .
    4 2006 "3699" "3699" 1 161  6190  334260000  1.202e+09 . .  3692
    4 2006 "D"    "D"    1 742 32191 1732668000  7.033e+09 . . 11154
    4 2007 "154"  "154"  1 206  5300  305280000  1.804e+09 . .  1237
    4 2007 "1722" "1722" 1  57  6700  385920000  9.970e+08 . .  5908
    4 2007 "2029" "2029" 1  50  1600   92160000  3.920e+08 . .     .
    4 2007 "221"  "221"  1  60  2290  131904000  1.800e+08 . .   983
    4 2007 "242"  "242"  1  12  1920  110592000  2.400e+08 . .  1207
    4 2007 "2520" "2520" 1  61  2500  132480000  8.170e+08 . .     .
    4 2007 "2695" "2695" 1  93  2400  270720000  1.743e+09 . .     .
    4 2007 "289"  "289"  1  86  2570  148042000  3.920e+08 . .     .
    4 2007 "3699" "3699" 1 183  6327  364428000  1.268e+09 . .  2856
    4 2007 "D"    "D"    1 808 31607 1941526000  7.833e+09 . . 12191
    4 2008 "154"  "154"  1 212  5431  322376000  1.791e+09 . .   899
    4 2008 "1722" "1722" 1  64  6917  423324000  9.150e+08 . .  6010
    4 2008 "2029" "2029" 1  45  1520   96672000  3.660e+08 . .     .
    4 2008 "221"  "221"  1  63  2290  140148000  1.800e+08 . .  1120
    4 2008 "242"  "242"  1  14  2133  130536000  8.171e+08 . .  1309
    4 2008 "2520" "2520" 1  58  2030  124236000  8.780e+08 . .     .
    4 2008 "2695" "2695" 1 101  5503  326780000  1.762e+09 . .     .
    4 2008 "289"  "289"  1  78  2440  149328000  3.910e+08 . .     .
    4 2008 "3699" "3699" 1 196  7056  431832000 1.4476e+09 . .  3791
    4 2008 "D"    "D"    1 831 35320 2145232000 8.5477e+09 . . 13129
    4 2009 "154"  "154"  1 199  5400  343440000 2.2688e+09 . .   870
    4 2009 "1549" "1549" 1 199  5400  343440000 2.2688e+09 . .   870
    4 2009 "172"  "172"  1  66  6902  438969000  4.450e+08 . .  6030
    4 2009 "1722" "1722" 1  66  6902  438969000  4.450e+08 . .  6030
    4 2009 "202"  "202"  1  37  1497   95202000  3.710e+08 . .     .
    4 2009 "2029" "2029" 1  37  1497   95202000  3.710e+08 . .     .
    4 2009 "221"  "221"  1  63  2290  145644000  1.930e+08 . .  1117
    4 2009 "2212" "2212" 1  63  2290  145644000  1.930e+08 . .  1117
    4 2009 "242"  "242"  1  11  2133  135028000  8.778e+08 . .  1325
    4 2009 "2429" "2429" 1   .  2133  135028000  8.778e+08 . .  1325
    4 2009 "2520" "2520" 1  58  2030  129168000  9.210e+08 . .     .
    4 2009 "269"  "269"  1 105  5510  385190000  1.845e+09 . .     .
    end
    Using Dataset
    Code:
     * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int(country year isic) long(Output ImportsWorld ExportsWorld ApparentConsumption ImportsDeveloping ImportsIndustrialized ExportsDeveloping ExportsIndustrialized)
    4 2002 1511    . . . . . . . .
    4 2002 1512    . . . . . . . .
    4 2002 1513    . . . . . . . .
    4 2002 1514    . . . . . . . .
    4 2002 1520    . . . . . . . .
    4 2002 1531    . . . . . . . .
    4 2002 1532    . . . . . . . .
    4 2002 1533    . . . . . . . .
    4 2002 1541    . . . . . . . .
    4 2002 1542    . . . . . . . .
    4 2002 1543    . . . . . . . .
    4 2002 1544    . . . . . . . .
    4 2002 1549    . . . . . . . .
    4 2002 1551    . . . . . . . .
    4 2002 1552    . . . . . . . .
    4 2002 1553    . . . . . . . .
    4 2002 1554    . . . . . . . .
    4 2002 1600    . . . . . . . .
    4 2002 1711    . . . . . . . .
    4 2002 1712    . . . . . . . .
    4 2002 1721    . . . . . . . .
    4 2002 1722 1616 . . . . . . .
    4 2002 1723    . . . . . . . .
    4 2002 1729    . . . . . . . .
    4 2002 1730    . . . . . . . .
    4 2002 1810    . . . . . . . .
    4 2002 1820    . . . . . . . .
    4 2002 1911    . . . . . . . .
    4 2002 1912    . . . . . . . .
    4 2002 1920    . . . . . . . .
    4 2002 2010    . . . . . . . .
    4 2002 2021    . . . . . . . .
    4 2002 2022    . . . . . . . .
    4 2002 2023    . . . . . . . .
    4 2002 2029  217 . . . . . . .
    4 2002 2101    . . . . . . . .
    4 2002 2102    . . . . . . . .
    4 2002 2109    . . . . . . . .
    4 2002 2211    . . . . . . . .
    4 2002 2212    . . . . . . . .
    4 2002 2213    . . . . . . . .
    4 2002 2219    . . . . . . . .
    4 2002 2221    . . . . . . . .
    4 2002 2222    . . . . . . . .
    4 2002 2230    . . . . . . . .
    4 2002 2310    . . . . . . . .
    4 2002 2320    . . . . . . . .
    4 2002 2330    . . . . . . . .
    4 2002 2411    . . . . . . . .
    4 2002 2412    . . . . . . . .
    4 2002 2413    . . . . . . . .
    4 2002 2421    . . . . . . . .
    4 2002 2422    . . . . . . . .
    4 2002 2423    . . . . . . . .
    4 2002 2424    . . . . . . . .
    4 2002 2429    . . . . . . . .
    4 2002 2430    . . . . . . . .
    4 2002 2511    . . . . . . . .
    4 2002 2519    . . . . . . . .
    4 2002 2520  965 . . . . . . .
    4 2002 2610    . . . . . . . .
    4 2002 2691    . . . . . . . .
    4 2002 2692    . . . . . . . .
    4 2002 2693    . . . . . . . .
    4 2002 2694    . . . . . . . .
    4 2002 2695    . . . . . . . .
    4 2002 2696    . . . . . . . .
    4 2002 2699    . . . . . . . .
    4 2002 2710    . . . . . . . .
    4 2002 2720    . . . . . . . .
    4 2002 2731    . . . . . . . .
    4 2002 2732    . . . . . . . .
    4 2002 2811    . . . . . . . .
    4 2002 2812    . . . . . . . .
    4 2002 2813    . . . . . . . .
    4 2002 2891    . . . . . . . .
    4 2002 2892    . . . . . . . .
    4 2002 2893    . . . . . . . .
    4 2002 2899    . . . . . . . .
    4 2002 2911    . . . . . . . .
    4 2002 2912    . . . . . . . .
    4 2002 2913    . . . . . . . .
    4 2002 2914    . . . . . . . .
    4 2002 2915    . . . . . . . .
    4 2002 2919    . . . . . . . .
    4 2002 2921    . . . . . . . .
    4 2002 2922    . . . . . . . .
    4 2002 2923    . . . . . . . .
    4 2002 2924    . . . . . . . .
    4 2002 2925    . . . . . . . .
    4 2002 2926    . . . . . . . .
    4 2002 2927    . . . . . . . .
    4 2002 2929    . . . . . . . .
    4 2002 2930    . . . . . . . .
    4 2002 3000    . . . . . . . .
    4 2002 3110    . . . . . . . .
    4 2002 3120    . . . . . . . .
    4 2002 3130    . . . . . . . .
    4 2002 3140    . . . . . . . .
    4 2002 3150    . . . . . . . .
    end

    Thanks!


  • #2
    Well, in the using data set, before attempting to merge them, you should -tostring isic, replace- so that it will be type-compatible with isic in the master data.

    Evidently, an isic code like "D" in the master can never match with anything in the using data because the values of isic in the using data are completely numeric. Nothing you can do about that unless the "D" is a data error and can be replaced with a valid numeric code.

    The other thing you may want to check in your real data sets is the range of years. In the example data you show, there are only four observations that can match with anything in the using data. That's because your using data is all for year 2002, whereas year 2002 only appears in four observations in the master data set. If the range of years in your master and using data sets are mismatched in that way, then very few of the observations will successfully match.

    If this doesn't solve your problem, please post back with example data sets that contain observations that you believe should be able to match with each other, and also show us the code you are using to try to merge the data sets.

    Comment


    • #3
      Hi! Thank you for your help!

      In fact, the master dataset has data from 1973 until 2018 while the using dataset has data from 1990 until 2018. Both are unbalanced panel. Clearly, I expect years before 1990 not to merge in the masterdataset and my expectations is that all information regarding dataset in the using dataset should merge. Doing it step by step. I open the master dataset, I drop variables that have a sourcecode different than 0 (in the using dataset, I do not have that problem). Ergo, the code should be

      use "/Users/hugovaca-pereirarocha/Desktop/Research Final/INDSTAT4Rev3/INDSTAT44 copy 5.dta", clear <
      keep if sourcecode==1
      duplicates tag country isic year, gen(flag)// country year isic do not uniquely identify observations
      keep if flag==0 // (Sample gets reduced to 165,760 observations other flag identifiers drop observations substantially)
      destring isic, replace force

      merge 1:1 country year isic using "/Users/hugovaca-pereirarocha/Desktop/Research Final/IDSB Clean> /CleanIDSB copy 5.dta (using dataset)" (129, 499 observations matched).
      // Attached you can you a set of columns that should have matched and did not match

      One question:Does it matter which I use as using dataset or which I use as master dataset?

      Dataex after the datasets are merged (without dropping unmerged data)

      Code:
       * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int(country year isic) str5 isiccomb byte sourcecode long(Establishments Employment) double(Wages OutputINDSTAT4 ValueAdded GrossFixed) long FemaleEmployees byte flag long(Output ImportsWorld ExportsWorld ApparentConsumption)
      4 1998    . "D"    1   .     .          .          . . .   231 0     .     .      .     .
      4 1999    . "D"    1   .     .          .          . . .    20 0     .     .      .     .
      4 2001 1511 "1511" 1   1     5          .          . . .     . 0     .     .      .     .
      4 2001 1513 "1513" 1  10    15          .          . . .     . 0     .     .      .     .
      4 2001 1514 "1514" 1   6    14          .          . . .     . 0     .     .      .     .
      4 2001 1541 "1541" 1   1     5          .          . . .     . 0     .     .      .     .
      4 2001 1554 "1554" 1   7     8          .          . . .     . 0     .     .      .     .
      4 2001 1712 "1712" 1   2    15          .          . . .     . 0     .     .      .     .
      4 2001 1722 "1722" 1   2    86          .          . . .     . 0     .     .      .     .
      4 2001 1920 "1920" 1   1    30          .          . . .     . 0     .     .      .     .
      4 2001 2211 "2211" 1   3     .          .          . . .     . 0     .     .      .     .
      4 2001 2212 "2212" 1   .   129          .          . . .     . 0     .     .      .     .
      4 2001 2219 "2219" 1   9    13          .          . . .     . 0     .     .      .     .
      4 2001 2412 "2412" 1   1   298          .          . . .     . 0     .     .      .     .
      4 2001 2520 "2520" 1  44     .          .          . . .     . 0     .     .      .     .
      4 2001 2694 "2694" 1   2    96          .          . . .     . 0     .     .      .     .
      4 2001 3699 "3699" 1  53    25          .          . . .     . 0     .     .      .     .
      4 2001    . "D"    1 153   739          .          . . .    52 0     .     .      .     .
      4 2002  154 "154"  1  57  1398   58716000  3.650e+08 . .   270 0     .     .      .     .
      4 2002  221 "221"  1  33   680   28260000    2000000 . .    77 0     .     .      .     .
      4 2002  242 "242"  1   5   296   11628000          . . .   150 0     .     .      .     .
      4 2002  289 "289"  1  32   344   11248000   11000000 . .     . 0     .     .      .     .
      4 2002 1722 "1722" 1  11   830   38860000   67000000 . .   710 0  1616     .      .     .
      4 2002 2029 "2029" 1  14   723   30780000    9000000 . .     . 0   217     .      .     .
      4 2002 2520 "2520" 1  69  1212   67704000   40000000 . .     . 0   965     .      .     .
      4 2002 2695 "2695" 1   .  2140   68880000          . . .     . 0     .     .      .     .
      4 2002 3699 "3699" 1   6  1389   63444000    2000000 . .   106 0    48     .      .     .
      4 2002    . "D"    1 232  9012  379520000  4.960e+08 . .  1313 0     .     .      .     .
      4 2003  154 "154"  1  68  1986   90564000  1.460e+08 . .   384 0     .     .      .     .
      4 2003  221 "221"  1  34   715   42604000   32000000 . .   125 0     .     .      .     .
      4 2003  242 "242"  1   6   921   41000000   95000000 . .   380 0     .     .      .     .
      4 2003  289 "289"  1  58   997   45468000   35000000 . .     . 0     .     .      .     .
      4 2003 1722 "1722" 1  22  2371  119134000  1.660e+08 . .  1800 0  3404     .      .     .
      4 2003 2029 "2029" 1  10   690   31464000   32000000 . .     . 0   656     .      .     .
      4 2003 2520 "2520" 1  52  1130   51528000   97000000 . .     . 0  1989     .      .     .
      4 2003 2695 "2695" 1  48  1823   83124000   65000000 . .     . 0  1333     .      .     .
      4 2003 3699 "3699" 1  14  2078   95172000   31000000 . .   699 0   636     .      .     .
      4 2003    . "D"    1 312 12711  600058000  6.990e+08 . .  3388 0     .     .      .     .
      4 2004  154 "154"  1  60  2170  104160000  3.580e+08 . .   720 0     .     .      .     .
      4 2004  221 "221"  1  44   990   34320000   73000000 . .   360 0     .     .      .     .
      4 2004  242 "242"  1  11  1170   56160000  2.040e+08 . .   720 0     .     .      .     .
      4 2004  289 "289"  1  68  1200   57600000   83000000 . .     . 0     .     .      .     .
      4 2004 1722 "1722" 1   8  1870   89760000  1.370e+08 . .  1709 0  2863     .      .     .
      4 2004 2029 "2029" 1  18   970   46560000  2.170e+08 . .     . 0  4535     .      .     .
      4 2004 2520 "2520" 1  74  1500   72000000  1.840e+08 . .     . 0  3846     .      .     .
      4 2004 2695 "2695" 1  36  1640   78720000  1.270e+08 . .     . 0  2654     .      .     .
      4 2004 3699 "3699" 1  28  2625  1.260e+08   46000000 . .  1020 0   961     .      .     .
      4 2004    . "D"    1 347 14135  665280000  1.429e+09 . .  4529 0     .     .      .     .
      4 2005  154 "154"  1 152  4376  2.208e+08  1.208e+09 . .   561 0     .     .      .     .
      4 2005  221 "221"  1  47  1380   81552000  1.200e+08 . .   345 0     .     .      .     .
      4 2005  242 "242"  1   7  1270   53928000  1.670e+08 . .   790 0     .     .      .     .
      4 2005  289 "289"  1  54  1190   54936000  2.930e+08 . .     . 0     .     .      .     .
      4 2005 1722 "1722" 1  31  4474  2.256e+08  7.880e+08 . .  3900 0 15921     .      .     .
      4 2005 2029 "2029" 1  43  1500   70560000  3.700e+08 . .     . 0  7476     .      .     .
      4 2005 2520 "2520" 1  45  1600   80680000  6.460e+08 . .     . 0 13052     .      .     .
      4 2005 2695 "2695" 1  65  3330  167832000  1.397e+09 . .     . 0 28225     .      .     .
      4 2005 3699 "3699" 1 123  3584  180636000  1.156e+09 . .  1090 0 23356     .      .     .
      4 2005    . "D"    1 567 22704 1136524000  6.145e+09 . .  6686 0     .     .      .     .
      4 2006  154 "154"  1 199  5278  269616000  1.419e+09 . .   901 0     .     .      .     .
      4 2006  221 "221"  1  56  2180  1.134e+08  1.360e+08 . .   459 0     .     .      .     .
      4 2006  242 "242"  1  10  1800   97200000  1.900e+08 . .  1102 0     .     .      .     .
      4 2006  289 "289"  1  78  2320  125280000  3.540e+08 . .     . 0     .     .      .     .
      4 2006 1722 "1722" 1  49  5808  313632000  9.640e+08 . .  5000 0 19309     .      .     .
      4 2006 2029 "2029" 1  47  1785   91260000  3.820e+08 . .     . 0  7651     .      .     .
      4 2006 2520 "2520" 1  58  1930  104220000  7.160e+08 . .     . 0 14341     .      .     .
      4 2006 2695 "2695" 1  84  4900  2.838e+08  1.670e+09 . .     . 0 33450     .      .     .
      4 2006 3699 "3699" 1 161  6190  334260000  1.202e+09 . .  3692 0 24076     .      .     .
      4 2006    . "D"    1 742 32191 1732668000  7.033e+09 . . 11154 0     .     .      .     .
      4 2007  154 "154"  1 206  5300  305280000  1.804e+09 . .  1237 0     .     .      .     .
      4 2007  221 "221"  1  60  2290  131904000  1.800e+08 . .   983 0     .     .      .     .
      4 2007  242 "242"  1  12  1920  110592000  2.400e+08 . .  1207 0     .     .      .     .
      4 2007  289 "289"  1  86  2570  148042000  3.920e+08 . .     . 0     .     .      .     .
      4 2007 1722 "1722" 1  57  6700  385920000  9.970e+08 . .  5908 0 19955     .      .     .
      4 2007 2029 "2029" 1  50  1600   92160000  3.920e+08 . .     . 0  7846     .      .     .
      4 2007 2520 "2520" 1  61  2500  132480000  8.170e+08 . .     . 0 16352     .      .     .
      4 2007 2695 "2695" 1  93  2400  270720000  1.743e+09 . .     . 0 34887     .      .     .
      4 2007 3699 "3699" 1 183  6327  364428000  1.268e+09 . .  2856 0 25379     .      .     .
      4 2007    . "D"    1 808 31607 1941526000  7.833e+09 . . 12191 0     .     .      .     .
      4 2008  154 "154"  1 212  5431  322376000  1.791e+09 . .   899 0     .     .      .     .
      4 2008  221 "221"  1  63  2290  140148000  1.800e+08 . .  1120 0     .     .      .     .
      4 2008  242 "242"  1  14  2133  130536000  8.171e+08 . .  1309 0     .     .      .     .
      4 2008  289 "289"  1  78  2440  149328000  3.910e+08 . .     . 0     .     .      .     .
      4 2008 1722 "1722" 1  64  6917  423324000  9.150e+08 . .  6010 0 18209     . 149623     .
      4 2008 2029 "2029" 1  45  1520   96672000  3.660e+08 . .     . 0  7284     .      .     .
      4 2008 2520 "2520" 1  58  2030  124236000  8.780e+08 . .     . 0 17473  8803      .     .
      4 2008 2695 "2695" 1 101  5503  326780000  1.762e+09 . .     . 0 35065     .      .     .
      4 2008 3699 "3699" 1 196  7056  431832000 1.4476e+09 . .  3791 0 28808 10599      .     .
      4 2008    . "D"    1 831 35320 2145232000 8.5477e+09 . . 13129 0     .     .      .     .
      4 2009  154 "154"  1 199  5400  343440000 2.2688e+09 . .   870 0     .     .      .     .
      4 2009  172 "172"  1  66  6902  438969000  4.450e+08 . .  6030 0     .     .      .     .
      4 2009  202 "202"  1  37  1497   95202000  3.710e+08 . .     . 0     .     .      .     .
      4 2009  221 "221"  1  63  2290  145644000  1.930e+08 . .  1117 0     .     .      .     .
      4 2009  242 "242"  1  11  2133  135028000  8.778e+08 . .  1325 0     .     .      .     .
      4 2009  269 "269"  1 105  5510  385190000  1.845e+09 . .     . 0     .     .      .     .
      4 2009  289 "289"  1  80  2432  154675000  3.824e+08 . .     . 0     .     .      .     .
      4 2009  369 "369"  1 192  7042  447871000  1.468e+09 . .  3800 0     .     .      .     .
      4 2009 1549 "1549" 1 199  5400  343440000 2.2688e+09 . .   870 0 45083  9302  30299 24086
      4 2009 1722 "1722" 1  66  6902  438969000  4.450e+08 . .  6030 0  8843     .  67860     .
      4 2009 2029 "2029" 1  37  1497   95202000  3.710e+08 . .     . 0  7372     .      .     .
      4 2009 2212 "2212" 1  63  2290  145644000  1.930e+08 . .  1117 0  3835     .      .     .
      end






      Attached Files
      Last edited by Hugo Rocha; 02 Feb 2022, 12:11.

      Comment


      • #4
        Thanks for the code. While I could criticize certain details of the way you have done this, they are not things that would have prevented the data from merging properly. Unfortunately, showing me a data set that is the result of the incomplete merging does not help find the cause of the incomplete merging. I would also need to see the corresponding observations from the master and using data sets to try to figure out why they did not merge with each other. Otherwise put, I think it is far more likely that the problem you are facing comes from errors in the data than from errors in the code. Please post back with example master and using data (separate data sets) that contain observations which should merge with each other but don't.

        Let me answer clearly your specific question about whether it matters which data set is master and which is using: no, it does not matter. In a 1:m or m:1 merge you would need to change that to m:1 and 1:m, respectively. But in a 1:1 merge it makes no difference whatsoever.

        A couple of comments on the code:

        Code:
        duplicates tag country isic year, gen(flag)// country year isic do not uniquely identify observations
        keep if flag==0 // (Sample gets reduced to 165,760 observations other flag identifiers drop observations substantially)
        Are you sure this is what you want to do? You are identifying any combinations of country year and isic for which there are multiple observations. I want to emphasize that you are dropping all of those observations. You are not even keeping one from each combination. So those combinations are completely purged from the data at that point (and any observations in the using data set that have that same combination will go unmatched). There is nothing illegal about this--but be sure it is what you want. The more usual situation when you have multiple observations per combination is to identify one observation from each combination, keep that one and drop the others. If that is what you thought you were doing, be advised that it is not.

        Code:
        destring isic, replace force
        Two things about this. In any situation, the use of the -force- option is risky. There may be values of ISIC codes that are not just obviously unmatchable with a numeric ISIC code. Sure "D" is hopeless. But what if you have some ISIC codes with typos like "12 75" or "12&5", either of which is a plausible typo for "1275"? You shouldn't just be purging all of these You should be identifying all the non-numeric codes with
        Code:
        tab isic if missing(real(isic))
        and verifying that all of them are truly unsalvagable before blindly sweeping them away. Now, perhaps you did that somewhere previously--in which case, you are fine. But the indiscriminate use of the -force- option is very unwise.

        The other thing about this is that, particularly when you are preparing to merge two data sets, it's generally a good idea to harmonize conflicting data types in favor of the one that has the greatest flexibility in terms of what it can hold. So I would have done this by converting the numeric ISICs in the using data set to strings, rather than what you did here. I think in your particular situation, there's no real harm done since, presumably (though see what I said above) the non-numeric ISIC codes are, if not entirely invalid, ineligible to merge with the using data anyway. But in other situations, violating this principle can cause problems. For example, if one data type is an int and the other is a float, converting the floats to ints could discard information and result in false matches, or loss of matches.

        Comment


        • #5
          Hi Professor Schechter!

          You raised really good points. I do not know if this is a proper example. But I am attaching pictures from a data point (country 4, isic 2899 year 2009) that is present in both datasets as an example but once I merge, it only says "master only(2)". In other words, it did not merge. My apologies if that's not sufficient, I can, for sure, try to provide more information. I tried to provide the three scenarios (master, using, and unsuccessful merge)

          Regarding your firs comment. Yes, I think that's where part of my problem (if not all) lies . I just want to drop one observation per tag, definitely not all. I was trying to keep the observations where you find the least amount of missing data. Hence, more useful to provide information on my estimations. Any advice on how to proceed? Checking observation by observation is just intractable given the huge sample size (in the master dataset more than 400,000 observations plus also more than 400,00 observations in the using data).

          Regarding your second comment, this is definitely something I have to keep in mind when I am using other datasets. For this particular case, it is not an issue since there are no typos in isic. Thank though, this is also very important

          Regarding your third comment, this is definitely important for the same reason of #2. Luckily for these datasets, I do not have that problem. I can have missing variables but not missing industries (In fact both panels are unbalanced, which makes the process a little harder).

          Thank you so much for your comments,

          Hugo

          Attached Files

          Comment


          • #6
            OK. I think that my point about handling the duplicate observations hits on the problem with your merge. I notice that in the master data, that combination of country 4, year 2009, isic 2899 is among those with duplicate observations. Your code then drops all of those. So by the time you get to the merge, there is no observation with that combination in the master data. It is instantiated in the using data, though, so in the merged result you see it there as using only. (You say it says "master only(2)"--but that is not possible. After -merge-, master only is 1, not 2. 2 is using only. Also in the result file that you show, the number 7599 is clearly from the using file.

            So, here's how I would approach this:
            Code:
            //  PREPARE THE USING DATA BY CONVERTING ISIC TO A STRING
            use using_data, clear
            tostring isic, replace
            tempfile holding
            save `holding'
            
            //  PREPARE THE MASTER DATA BY REDUCING TO ONE OBSERVATION
            //  PER COUNTRY#YEAR#ISIC, RETAINING THE ONE WITH THE LEAST
            //  AMOUNT OF MISSING DATA
            use master_data, clear
            egen miss_values = rowmiss(_all)
            by country year isic (miss_values), sort: keep if _n == 1
            drop miss_values
            
            //  MERGE THEM
            merge 1:1 country year isic using `holding'
            I think that will do the trick.

            Comment


            • #7
              Thanks, Professor Schechter! I will try your way now!

              Comment


              • #8
                Your adjustment actually increases the sample merged substantially.
                Attached Files

                Comment

                Working...
                X