Problems with merge command

albert master

Join Date: Mar 2022

Posts: 23
#1

Problems with merge command

01 Apr 2022, 15:31

Hi!

I have a cross-sectional time-series dataset and would like to add some firm-specific information from another dataset.

The variables I intend to use for merging is a company identifier (ds_code) and year. Unfortunately I have no idea why I receive 0 matched observations I checked for duplicates and the format should be fitting as well?

. merge m:1 ds_code year using "C:\Users\Desktop\Stata - Analysis\All_Financials.dta"
(note: variable ds_code was str6, now str7 to accommodate using data's values)

Result # of obs.
-----------------------------------------
not matched 1,992,565
from master 3,998 (_merge==1)
from using 1,988,567 (_merge==2)

matched 0 (_merge==3)
-----------------------------------------

Can anyone give me an advice from outside what might be the cause for this problem?

Thanks a lot!
Tags: None
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#2

01 Apr 2022, 15:37

Your ds_code variable doesn't have anything in common with the one you're merging it with in the other dataset. If I have one unit whose ID is 42, and another dataset who's ID is 43, I can't merge on those because the numbers must be the exact same.

EDIT: As per Clyde's advice, I didn't think to comment on the fact that the id is a string. Generally, we shouldn't want to merge on these, but if we must, they must be spelled the same, down to capitalization and spelling.

Recently, I was making a dataset of Brazilian districts (well, districts in Sao Paulo and Rio) and differences in the spelling of the words mattered. Sao is not the same as São, Stata takes that kind of thing seriously, and rightly so.

So, if this were my problem, I would check and see some IDs that I know should be in the same dataset (let's say, "Wal Mart" and "Wal-Mart") and see what the real issue is here.

Last edited by Jared Greathouse; 01 Apr 2022, 15:43.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29807
#3

01 Apr 2022, 15:38

Without showing example data from the two data sets, using the -dataex- command, you make it impossible to give specific advice. The problem is not with the code--it is that the data and the code are not suited to each other in some way. But without seeing the data, it's anybody's guess what way that might be.

Speaking generically, the commonest cause for this is that the variable ds_code is coded differently in the two data sets. Perhaps one file uses upper case letters and the other uses lower case. Perhaps one file has more digits of precision than the other (it is telling that you were warned by Stata that ds_code is 6 characters long in one of the files but 7 in the other: what's that extra character doing?). Perhaps the extra character is blank padding in the second file that doesn't appear in the first--this is something that would be difficult to spot with your eye. Perhaps one data set breaks the code with a hyphen somewhere in the middle and the other doesn't. I could go on. The point is that -merge- will only pair up observations if the ds_code values match exactly. Evidently, in your data sets, they don't. But knowing the difference between them requires looking at the data.

A less likely possibility, but I have seen it, is that unbeknownst to you until now, you are actually working with data sets that are about completely different companies.

Added: Crossed with #2, whose suggestion is in line with what I consider the less likely possibility.

Last edited by Clyde Schechter; 01 Apr 2022, 15:41.
1 like
Comment
albert master

Join Date: Mar 2022

Posts: 23
#4

01 Apr 2022, 15:56

Originally posted by Jared Greathouse View Post

Your ds_code variable doesn't have anything in common with the one you're merging it with in the other dataset. If I have one unit whose ID is 42, and another dataset who's ID is 43, I can't merge on those because the numbers must be the exact same.

EDIT: As per Clyde's advice, I didn't think to comment on the fact that the id is a string. Generally, we shouldn't want to merge on these, but if we must, they must be spelled the same, down to capitalization and spelling.

Recently, I was making a dataset of Brazilian districts (well, districts in Sao Paulo and Rio) and differences in the spelling of the words mattered. Sao is not the same as São, Stata takes that kind of thing seriously, and rightly so.

So, if this were my problem, I would check and see some IDs that I know should be in the same dataset (let's say, "Wal Mart" and "Wal-Mart") and see what the real issue is here.

Thanks for your answer.
Prior ths merge I merged both datasets in the same way like above and it worked perfectly.
So I just added more company oberservation and like to merge them with financial informations.
Because of that the ds_code in combination with the year should be the right identifier?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29807
#5

01 Apr 2022, 16:11

Because of that the ds_code in combination with the year should be the right identifier?

No. That will not improve things at all. If you cannot find exact matches on just ds_code, there is no possibility of finding exact matches on both ds_code and year.

If you were getting a "ds_code does not uniquely identify observations in the using data" error message, then you might solve that by extending the merge key to include year. But adding still more requirements for matching cannot possibly help here.

And you still have shown nothing of the data, so still no specific advice can be given.

Prior ths merge I merged both datasets in the same way like above and it worked perfectly.

I doubt this is actually the case. Something has changed or you would get the same result as before. Perhaps you changed one of the data sets in some subtle way whose implications you did not foresee. Or perhaps somebody else changed it since you last used it.
Comment

albert master

Join Date: Mar 2022
Posts: 23

02 Apr 2022, 02:06

Hi, sorry for that I hope the data below give you the neccessary overview:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 ds_code float year
"130062"  7
"130062"  9
"130062" 10
"130062" 12
"130062" 14
"130062"  8
"130062" 11
"130062" 13
"130298"  9
"130298"  5
"130298"  8
"130298"  1
"130298" 12
"130298"  3
"130298"  7
"130298" 13
"130298" 14
"130298"  2
"130298"  6
"130298" 10
"130298" 11
"130298"  4
"130375" 12
"130375"  8
"130375"  9
"130375" 11
"130375" 14
"130375"  7
"130375" 10
"130375" 13
"130775"  8
"130775"  7
"130775"  2
"130775"  1
"130775" 14
"130775" 11
"130775"  5
"130775" 13
"130775"  4
"130775"  6
"130775" 12
"130775" 10
"130775"  9
"130775"  3
"13466Q" 12
"13466Q"  9
"13466Q"  7
"13466Q" 13
"13466Q" 14
"13466Q"  8
"13466Q" 10
"13466Q" 11
"13471D"  8
"13471D"  7
"13471D"  9
"13471D" 14
"13471D" 12
"13471D" 10
"13471D" 13
"13471D" 11
"13906D"  7
"13906D" 10
"13906D"  1
"13906D" 11
"13906D"  6
"13906D"  3
"13906D" 13
"13906D"  4
"13906D"  8
"13906D"  5
"13906D" 14
"13906D"  2
"13906D" 12
"13906D"  9
"13950L" 12
"13950L"  7
"13950L"  8
"13950L" 11
"13950L" 13
"13950L"  4
"13950L"  3
"13950L"  5
"13950L" 14
"13950L"  1
"13950L" 10
"13950L"  2
"13950L"  6
"13950L"  9
"13951F" 13
"13951F"  8
"13951F"  7
"13951F" 10
"13951F"  9
"13951F" 11
"13951F" 14
"13951F" 12
"13967C"  1
"13967C" 13
"13967C"  8
"13967C"  5
end
label values year year1
label def year1 1 "2004", modify
label def year1 2 "2005", modify
label def year1 3 "2006", modify
label def year1 4 "2007", modify
label def year1 5 "2008", modify
label def year1 6 "2009", modify
label def year1 7 "2010", modify
label def year1 8 "2011", modify
label def year1 9 "2012", modify
label def year1 10 "2013", modify
label def year1 11 "2014", modify
label def year1 12 "2015", modify
label def year1 13 "2016", modify
label def year1 14 "2017", modify

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 ds_code float year
"130062" 1999
"130062" 2000
"130062" 2001
"130062" 2002
"130062" 2003
"130062" 2004
"130062" 2005
"130062" 2006
"130062" 2007
"130062" 2008
"130062" 2009
"130062" 2010
"130062" 2011
"130062" 2012
"130062" 2013
"130062" 2014
"130062" 2015
"130062" 2016
"130062" 2017
"130062" 2018
"130063" 2015
"130063" 2016
"130063" 2017
"130065" 1999
"130065" 2000
"130065" 2001
"130065" 2002
"130065" 2003
"130065" 2004
"130065" 2005
"130065" 2006
"130065" 2007
"130065" 2008
"130065" 2009
"130065" 2010
"130065" 2011
"130065" 2012
"130065" 2013
"130065" 2014
"130065" 2015
"130065" 2016
"130065" 2017
"130066" 2015
"130066" 2016
"130066" 2017
"130067" 2015
"130067" 2016
"130067" 2017
"130072" 2015
"130072" 2016
"130072" 2017
"130073" 2015
"130073" 2016
"130073" 2017
"130079" 1999
"130079" 2000
"130079" 2001
"130079" 2002
"130079" 2003
"130079" 2004
"130079" 2005
"130079" 2006
"130079" 2007
"130079" 2008
"130079" 2009
"130079" 2010
"130079" 2011
"130079" 2012
"130079" 2013
"130079" 2014
"130079" 2015
"130079" 2016
"130079" 2017
"130083" 2015
"130083" 2016
"130083" 2017
"130086" 1999
"130086" 2000
"130086" 2001
"130086" 2002
"130086" 2003
"130086" 2004
"130086" 2005
"130086" 2006
"130086" 2007
"130086" 2008
"130086" 2009
"130086" 2010
"130086" 2011
"130086" 2012
"130086" 2013
"130086" 2014
"130086" 2015
"130086" 2016
"130086" 2017
"130088" 1999
"130088" 2000
"130088" 2001
"130088" 2002
"130088" 2003
end

Actually you might be right, that the format of the variable "year" in my master is different compared to my using datasample?
Thanks a lot for your help!

Last edited by albert master; 02 Apr 2022, 02:20.

Comment

Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#7

02 Apr 2022, 06:55

For dataex 1, why not just have the year as a number? No need to label it. If I were you, I would just add year+2003 and that gives us the real years we're interested in, no value labels needed.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

02 Apr 2022, 10:00

Elaborating on Jared's answer, the problem with your first dataset is that it originally had the year as a string variable, and you used the encode command to create a numeric variable.

The encode command is designed for assigning numerical codes to non-numeric strings like "France", "Germany", "United States". The output of help encode instructs us

Do not use encode if varname contains numbers that merely happen to be stored as strings; instead, use generate newvar = real(varname) or destring; see real() or [D] destring.

You should review your work and see if you have made this mistake elsewhere. Jared's workaround to correct the data will fix this case, although my preference would be to correct the program with the error and rerun it, and any subsequent programs. That way if you were use that program on new data, or as the basis of another program to process different data, you will not repeat the mistake.
1 like
Comment

Announcement

Problems with merge command

Comment

Comment

Comment

Comment

Comment

Comment

Comment