Merging three datasets

Guest
#1

Merging three datasets

22 Apr 2017, 00:56

Hello,
I have my master panel dataset with 14 variables and want to merge Dataset B (10 Variables) and Dataset C (1 Variable).

I have tried to merge with:
Merge 1:1 id fyear using DatasetB
Drop _merge
Merge 1:1 id fyear using DatasetC

I have prepared all datasets, so that each observation is uniquely identified by id and fyear ( deleted duplicates) and declared panel datasets and sorted id fyear.

However, I get a large number of unmatched observations. Am I doing anything wrong?

Dataset A:
obs: 31,640
vars: 14
size: 3,417,120
------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------------
fyear double %6.0g Data Year - Fiscal
at double %18.0g Assets - Total
csho double %18.0g Common Shares Outstanding
dlc double %18.0g Debt in Current Liabilities - Total
dltt double %18.0g Long-Term Debt - Total
dvc double %18.0g Dividends Common/Ordinary
ib double %18.0g Income Before Extraordinary Items
ni double %18.0g Net Income (Loss)
oiadp double %18.0g Operating Income After Depreciation
prstkc double %18.0g Purchase of Common and Preferred Stock
sale double %18.0g Sales/Turnover (Net)
xsga double %18.0g Selling, General and Administrative Expense
prcc_f double %18.0g Price Close - Annual - Fiscal
id long %8.0g id Standard and Poor's Identifier
------------------------------------------------------------------------------------------------------------------------------
Sorted by: id fyear

Dataset B:

Contains data from /Users/Guest/Desktop/Stata/Agencypart2prepared.dta
obs: 34,842
vars: 10 20 Apr 2017 15:43
size: 2,369,256
----------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------
datadate long %d Data Date
fyear double %6.0g Data Year - Fiscal
dvt double %18.0g Dividends - Total
oibdp double %18.0g Operating Income Before Depreciation
txt double %18.0g Income Taxes - Total
xad double %18.0g Advertising Expense
xint double %18.0g Interest and Related Expense - Total
xrd double %18.0g Research and Development Expense
id long %8.0g id Standard a

----------+--------------------------------------------------------
datadate | 34842 19628.96 692.1245 17531 20819
fyear | 34840 2012.78 1.893907 2010 2016
dvt | 29645 150.1909 803.536 -255.458 67643.8
oibdp | 29616 890.8212 3714.066 -21913 81730
txt | 30924 132.2633 871.8798 -34831 31051
-------------+--------------------------------------------------------
xad | 12427 107.2393 460.2595 0 9729
xint | 26051 112.075 498.2739 -.664 18562.28
xrd | 15855 178.3253 820.7956 -.202 16085
id | 34842 15752.67 7018.648 5 24304

DatasetC
obs: 35,981
vars: 5 21 Apr 2017 19:56
size: 1,007,468
----------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------
datadate long %d Data Date
fyear double %6.0g Data Year - Fiscal
re double %18.0g Retained Earnings
id long %8.0g id Standard and Poor's Identifier
dup float %9.0g

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
datadate | 35981 19666.62 712.2195 17531 20819
fyear | 35979 2012.882 1.947104 2010 2016
re | 31033 1613.649 10959.86 -117956.9 389427
id | 35981 15782.92 7018.259 5 24361
dup | 35981 .0094494 .0967492 0

My results :

. merge 1:1 id fyear using SetB.dta
(label id already defined)

Result # of obs.
-----------------------------------------
not matched 59,138
from master 27,968 (_merge==1)
from using 31,170 (_merge==2)

matched 3,672 (_merge==3)
-----------------------------------------

. drop _merge

. merge 1:1 id fyear using SetC.dta
(label id already defined)

Result # of obs.
-----------------------------------------
not matched 60,315
from master 43,572 (_merge==1)
from using 16,743 (_merge==2)

matched 19,238 (_merge==3)
-----------------------------------------

Thank you for any help.

Last edited by sladmin; 11 May 2017, 08:46. Reason: Anonymize poster
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#2

22 Apr 2017, 10:53

However, I get a large number of unmatched observations. Am I doing anything wrong?

There is no way to really answer that question. Your commands appear to be correct for merging the three data sets on the assumption that in each one the observations are uniquely identified by the combination of id and fyear. There is nothing wrong with that part of things.

So Stata is telling you that there isn't all that much overlap of id-fyear combinations in the three data sets. The question is why. It appears you expect that there should be substantial overlap; you expected not to see many _merge == 1 or 2. Where does that expectation come from? If the nature and origin of these data sets is such that your expectation should be true, then you have identified problems with your data sets. There are a few things you can troubleshoot easily:

1. id's can be tricky to work with. If they were social security numbers, for example, there might be inconsistent coding of 123-45-6789 vs 123456789 which could cause mismatches due to different formatting of the "same" number. The solution would then be to clean each of the data sets to impose a uniform format on the ids in every observation of all three data sets. Similarly some financial ID coding systems offer versions with different numbers of digits; if these have been mixed together, they will not match properly.

2. Though it is less likely, failure to match fiscal year could arise if there is inconsistency in recording them as two digits or four digits.

If you do not find obvious difficulties like this, then it is possible that errors were made in the creation of data sets A, B, and C so that their contents are not as advertised. If you created them yourself, then you will need to revisit that code and fix it. If not, you need to take it up with the people who gave them to you.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

22 Apr 2017, 11:34

Guest, inspecting non-matched cases usually hints at the problem. So, list if _merge==1 or list if _merge ==2 and try to identify what the problem is. As Clyde has mentioned already there is no obvious problem with your commands. Best, Sergiy

Last edited by sladmin; 11 May 2017, 08:46. Reason: Anonymize poster
Comment
Guest
#4

23 Apr 2017, 01:35

Thank you Sergiy and Clyde ! They are no social security numbers. Just firm id,so the formatting should be ok. Also the fiscal years are recorded consistently. They initially were in a string variable, so to be able to declare a panel dataset, i did :

encode gvkey, gen(id) ( did this for all datasets)

could something have gone wrong with this command ?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#5

23 Apr 2017, 09:36

Yes, that is almost certainly a serious mistake.

-encode- takes a string variable, gvkey in this case, and it creates a numeric encoding starting with 1. The alphabetically first gvkey in the data set is coded as 1. The alphabetically second gvkey is coded as 2, etc. If the three data sets do not have exactly the same set of gvkey's, then the coding is different. When you do the -merge-, Stata merges them on these numeric 1, 2, 3,... codes, not on the gvkey itself.

When working with gvkeys, it is often fine to just leave them as string variables. If you need a numeric version of it, say to use as a panelvar in an -xtset- command, you can use -encode- to do that, but you should not do that until you have already done all the -merge-ing and -append-ing. Using -encode- in separate data sets and then putting them together almost always ends badly.
Comment

Announcement

Merging three datasets

Comment

Comment

Comment

Comment