merging two panel datasets on name and year using matchit or reclink

Fleur Ponne

Join Date: Apr 2018
Posts: 6

merging two panel datasets on name and year using matchit or reclink

11 Apr 2018, 05:23

Hi all

I am trying to merge two panel data sets and therefore I am wondering in what way I could best use either reclink or matchit to merge them?

The first dataset contains variables/information on assurance services and the second one contains financial data.
I need to merge the data on company name and year. The two datasets differ with regards to the "name" variable but I want to merge and look into the companies that occur in both datasets. It is however so that the names of the companies are not written completely the same in both datasets (e.g. "ABN AMRO holding" in first set and "abn amro holding NV" in the second)For this variable I thus need a fuzzy match. The year variable, on the other hand, should be merged 1:1.

A final important remark is that both datasets do not contain the same amount of observations. It could be that a company is found in the first dataset but not in the second and vice versa.

Here’s an example of what the data looks like:

Source A

name
AUDI AG
AUDI AG
AUDI AG
AVINOR
AVINOR
AVINOR
AVINOR
AWISTA GmbH
AWISTA GmbH
AXA
AXA
AXA
AXA
AXA
AXA

year
2013
2015
2017
2014
2015
2016
2017
2015
2016
2012
2013
2014
2015
2016
2017

Source B

Name
AUDI AG
AUDI AG
AUDI AG
AUDI AG
AUDI AG
AUDI AG
AUDIKA GROUPE
AUDIKA GROUPE
AUDIKA GROUPE
AUDINATE PTY LTD
AUDINATE PTY LTD
AUDINATE PTY LTD
AUDINATE PTY LTD
AUDIO PIXELS HOLDINGS LTD
AUDIO PIXELS HOLDINGS LTD

year
2012
2013
2014
2015
2016
2017
2012
2013
2014
2014
2015
2016
2017
2012
2013

I hope anyone has a good tip for me!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29963
#2

11 Apr 2018, 10:26

The need to match on a time variable as well as the name somewhat complicates matters. I think I would do this by first creating a crosswalk between the names in dataset A and dataset B, and then subsequently do a merge. So something like this:

Code:

use dataset_A keep name duplicates drop gen obs_no = _n preserve use dataset_B, clear keep name rename name name_b duplicates drop gen obs_no = _n tempfile b save `b' restore matchit obs_no name using `b', idusing(obs_no) txtusing(name_b)

At this point you will have a data set in memory containing all of the names from both data sets and a score for how well they match. You then need to inspect this and decide what threshold on the similarity score gives you reasonable sensitivity and specificity for matching. Then just keep those observations that are above that score and save the data set in a crosswalk file.

Now you can go back and use this to merge the data sets:

Code:

use dataset_A merge 1:m name using crosswalk_file, nogenerate rename name name_a rename name_b name merge m:1 name year using dataset_B rename name name_b

Notes:

1. None of this is tested, so you may have to work out some bugs. This is the gist of it.

2. Your data examples are posted in some way that I have not encountered previously, so I was unable to find a way to import it to Stata short of typing it in manually, which I did not do. Please read the Forum FAQ for information about the helpful ways to show example data, Stata code, and Stata outputs here. In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

3. This code assumes that your panel data sets are properly structured and, specifically, that each firm has only a single observation in any given year.
Comment

Fleur Ponne

Join Date: Apr 2018
Posts: 6

11 Apr 2018, 14:23

Dear Clyde, thank you very much. I am following up your tips, however, I get an error at the following point:

Code:

. save b, replace
file b.dta saved

. restore

. matchit obs_no name using "b.dta", idusing(obs_no) txtusing(name_b)
Matching current dataset with b.dta
 
(!) Unsaved changes will be destroyed after running matching procedure.
    (note: use OVERRIDE option to bypass warning)
no; data in memory would be lost
r(4);

How do I use the override option to work around this error?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29963
#4

11 Apr 2018, 15:03

Sorry for not anticipating that. You just add it to the command, after the -txtusing()- option.

So:

Code:

matchit obs_no name using "b.dta", idusing(obs_no) txtusing(name_b) override

You did not identify yourself as a Stata beginner, and from your post I imagined that you have some experience using Stata. So I am surprised that you did not know how to add an option to the command. You are embarking on a project that is fairly advanced. I think you are going to encounter difficulties as you proceed if you do not stop and familiarize yourself with the fundamentals of Stata. From Stata's Help menu select PDF documentation. Read the Getting Started [GS] and User's Guide [U] volumes. It's a lengthy read, but it will expose you to the basics of Stata command syntax, Stata's overall data model and approach to management and analysis, and the commands that every Stata programmer needs to use on a regular basis. You won't remember everything, but you will know enough to be able understand code and make simple changes to it. You will also, in most situations, be able to recognize what commands are likely to be needed for a given task, and you can then get the details you don't remember from the help files or the PDF documentation.

I'll also point out that had you posted your example data with -dataex-, as is requested in the Forum FAQ, in #1, I would have been able to do a test run with my code and I would have caught my error before posting my response. It really does make sense to show example data. I know you intended to do that. But you really need to use -dataex- so that good intentions lead to useful actions.
Comment

Jeff Tree

Join Date: May 2019
Posts: 33

28 May 2020, 12:11

Hi all,

I'm working through a similar problem as the original post, but with a slightly additional layer of complexity of merging on whether a year falls within a particular range. More specifically, I wanted to add a variable to an existing dataset if there was a matching string and the year falls within a specified range. Minimal working example below. I wanted to post here for posterity in case others had a similar question. That said, although it appears that I achieved the desired output can I simplify my coding or perhaps use a more efficient method? Thanks!

Code:

clear
input byte id1 str14 name1 int year int variable
1 "Doe, Jon"           1980            1
2 "Will Green"         1988              0
3 "Jone, Emily"        1985             0
4 "Sara Brown"         1980            1
5 "Daniel Kennedy"     1992             0
6 "Bobby Quest"         1995            1        
end
save file1.dta, replace

clear
input byte id2 str13 name2 int startyear    int endyear
7 "John Doe"            1980    1982
8 "William Green"        1989    1991            
9 "Emily Jones"         1980    1989
10 "Sarah Brown"          1980    1980
11 "Dan Kennedy"           1990    1991
12 "Sandra Lee"            1992    1998
end
save file2.dta, replace

matchit id2 name2 using file1.dta, idusing(id1) txtusing(name1) threshold(.4)
merge m:m id1 using file1.dta, nogenerate
save crosswalk.dta, replace

clear
use file2.dta
isid id2 , sort
rangejoin year startyear endyear using "crosswalk.dta", by(id2)
by id2 startyear endyear, sort: keep if _n == 1
drop name2_U id1 name1 year similscore
list

Comment

Jae Li

Join Date: May 2017

Posts: 184
#6

25 May 2024, 07:36

@Clyde Schechter Hi Clyde, thank you for sharing your thoughts in this thread. Can I ask a quick question? In #2 post, there is no year variable in either dataset_A or crosswalk dataset. How to do the m:1 merge with dataset_B?
Comment

Announcement

merging two panel datasets on name and year using matchit or reclink

Comment

Comment

Comment

Comment

Comment