Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging three datasets

    Hello,
    I have my master panel dataset with 14 variables and want to merge Dataset B (10 Variables) and Dataset C (1 Variable).

    I have tried to merge with:
    Merge 1:1 id fyear using DatasetB
    Drop _merge
    Merge 1:1 id fyear using DatasetC

    I have prepared all datasets, so that each observation is uniquely identified by id and fyear ( deleted duplicates) and declared panel datasets and sorted id fyear.

    However, I get a large number of unmatched observations. Am I doing anything wrong?

    Dataset A:
    obs: 31,640
    vars: 14
    size: 3,417,120
    ------------------------------------------------------------------------------------------------------------------------------
    storage display value
    variable name type format label variable label
    ------------------------------------------------------------------------------------------------------------------------------
    fyear double %6.0g Data Year - Fiscal
    at double %18.0g Assets - Total
    csho double %18.0g Common Shares Outstanding
    dlc double %18.0g Debt in Current Liabilities - Total
    dltt double %18.0g Long-Term Debt - Total
    dvc double %18.0g Dividends Common/Ordinary
    ib double %18.0g Income Before Extraordinary Items
    ni double %18.0g Net Income (Loss)
    oiadp double %18.0g Operating Income After Depreciation
    prstkc double %18.0g Purchase of Common and Preferred Stock
    sale double %18.0g Sales/Turnover (Net)
    xsga double %18.0g Selling, General and Administrative Expense
    prcc_f double %18.0g Price Close - Annual - Fiscal
    id long %8.0g id Standard and Poor's Identifier
    ------------------------------------------------------------------------------------------------------------------------------
    Sorted by: id fyear

    Dataset B:

    Contains data from /Users/Guest/Desktop/Stata/Agencypart2prepared.dta
    obs: 34,842
    vars: 10 20 Apr 2017 15:43
    size: 2,369,256
    ----------------------------------------------------------------------------------------
    storage display value
    variable name type format label variable label
    ----------------------------------------------------------------------------------------
    datadate long %d Data Date
    fyear double %6.0g Data Year - Fiscal
    dvt double %18.0g Dividends - Total
    oibdp double %18.0g Operating Income Before Depreciation
    txt double %18.0g Income Taxes - Total
    xad double %18.0g Advertising Expense
    xint double %18.0g Interest and Related Expense - Total
    xrd double %18.0g Research and Development Expense
    id long %8.0g id Standard a

    ----------+--------------------------------------------------------
    datadate | 34842 19628.96 692.1245 17531 20819
    fyear | 34840 2012.78 1.893907 2010 2016
    dvt | 29645 150.1909 803.536 -255.458 67643.8
    oibdp | 29616 890.8212 3714.066 -21913 81730
    txt | 30924 132.2633 871.8798 -34831 31051
    -------------+--------------------------------------------------------
    xad | 12427 107.2393 460.2595 0 9729
    xint | 26051 112.075 498.2739 -.664 18562.28
    xrd | 15855 178.3253 820.7956 -.202 16085
    id | 34842 15752.67 7018.648 5 24304


    DatasetC
    obs: 35,981
    vars: 5 21 Apr 2017 19:56
    size: 1,007,468
    ----------------------------------------------------------------------------------------
    storage display value
    variable name type format label variable label
    ----------------------------------------------------------------------------------------
    datadate long %d Data Date
    fyear double %6.0g Data Year - Fiscal
    re double %18.0g Retained Earnings
    id long %8.0g id Standard and Poor's Identifier
    dup float %9.0g


    Variable | Obs Mean Std. Dev. Min Max
    -------------+--------------------------------------------------------
    datadate | 35981 19666.62 712.2195 17531 20819
    fyear | 35979 2012.882 1.947104 2010 2016
    re | 31033 1613.649 10959.86 -117956.9 389427
    id | 35981 15782.92 7018.259 5 24361
    dup | 35981 .0094494 .0967492 0

    My results :

    . merge 1:1 id fyear using SetB.dta
    (label id already defined)

    Result # of obs.
    -----------------------------------------
    not matched 59,138
    from master 27,968 (_merge==1)
    from using 31,170 (_merge==2)

    matched 3,672 (_merge==3)
    -----------------------------------------

    . drop _merge

    . merge 1:1 id fyear using SetC.dta
    (label id already defined)

    Result # of obs.
    -----------------------------------------
    not matched 60,315
    from master 43,572 (_merge==1)
    from using 16,743 (_merge==2)

    matched 19,238 (_merge==3)
    -----------------------------------------


    Thank you for any help.
    Last edited by sladmin; 11 May 2017, 08:46. Reason: Anonymize poster

  • #2
    However, I get a large number of unmatched observations. Am I doing anything wrong?
    There is no way to really answer that question. Your commands appear to be correct for merging the three data sets on the assumption that in each one the observations are uniquely identified by the combination of id and fyear. There is nothing wrong with that part of things.

    So Stata is telling you that there isn't all that much overlap of id-fyear combinations in the three data sets. The question is why. It appears you expect that there should be substantial overlap; you expected not to see many _merge == 1 or 2. Where does that expectation come from? If the nature and origin of these data sets is such that your expectation should be true, then you have identified problems with your data sets. There are a few things you can troubleshoot easily:

    1. id's can be tricky to work with. If they were social security numbers, for example, there might be inconsistent coding of 123-45-6789 vs 123456789 which could cause mismatches due to different formatting of the "same" number. The solution would then be to clean each of the data sets to impose a uniform format on the ids in every observation of all three data sets. Similarly some financial ID coding systems offer versions with different numbers of digits; if these have been mixed together, they will not match properly.

    2. Though it is less likely, failure to match fiscal year could arise if there is inconsistency in recording them as two digits or four digits.

    If you do not find obvious difficulties like this, then it is possible that errors were made in the creation of data sets A, B, and C so that their contents are not as advertised. If you created them yourself, then you will need to revisit that code and fix it. If not, you need to take it up with the people who gave them to you.

    Comment


    • #3
      Guest, inspecting non-matched cases usually hints at the problem. So, list if _merge==1 or list if _merge ==2 and try to identify what the problem is. As Clyde has mentioned already there is no obvious problem with your commands. Best, Sergiy
      Last edited by sladmin; 11 May 2017, 08:46. Reason: Anonymize poster

      Comment


      • #4
        Thank you Sergiy and Clyde ! They are no social security numbers. Just firm id,so the formatting should be ok. Also the fiscal years are recorded consistently. They initially were in a string variable, so to be able to declare a panel dataset, i did :

        encode gvkey, gen(id) ( did this for all datasets)

        could something have gone wrong with this command ?


        Comment


        • #5
          Yes, that is almost certainly a serious mistake.

          -encode- takes a string variable, gvkey in this case, and it creates a numeric encoding starting with 1. The alphabetically first gvkey in the data set is coded as 1. The alphabetically second gvkey is coded as 2, etc. If the three data sets do not have exactly the same set of gvkey's, then the coding is different. When you do the -merge-, Stata merges them on these numeric 1, 2, 3,... codes, not on the gvkey itself.

          When working with gvkeys, it is often fine to just leave them as string variables. If you need a numeric version of it, say to use as a panelvar in an -xtset- command, you can use -encode- to do that, but you should not do that until you have already done all the -merge-ing and -append-ing. Using -encode- in separate data sets and then putting them together almost always ends badly.

          Comment

          Working...
          X