Record linkage with quantitative variables

paulvonhippel

Join Date: Apr 2014

Posts: 496
#1

Record linkage with quantitative variables

04 Aug 2020, 22:49

I have two datasets describing the same 80 schools.

Some variables are similar -- for example, both lists include the schools' enrollment and % African American -- but there is no common ID or other variable that matches the two lists perfectly.

Is there a command that will find the best way to link the two datasets?

If not, I've done much of the work. I've made an 80x80=6400 row dataset that gives the Mahalanobis distance between every possible combination of a row from the first dataset with a row from the second. Now how do I optimally sort and deduplicate 6400 rows to get the best 80 matches, with no duplicates from either list?
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3047
#2

05 Aug 2020, 01:46

Consider sorting on the variables that you think that (almost?) identify observations in your datasets, and then matching by observations number.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#3

05 Aug 2020, 08:14

paulvonhippel
Are the data based on schools in the US? Just wondering if using any publicly available data might be helpful/useful in that instance.
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#4

05 Aug 2020, 10:03

Yes, it's a small sample of US schools which I am linking to the Common Core of Data. The linking variables match imperfectly, hence my question.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#5

05 Aug 2020, 11:25

I clearly don't completely understand your data, but you might want to take a look at two user-written commands: -reclink- and -matchit-; use -search- to find and download
Comment
paulvonhippel

Join Date: Apr 2014

Posts: 496
#6

05 Aug 2020, 12:25

Thanks for the suggestions, Rich Goldstein. I believe that both -reclink- and -matchit- are for string variables. The problem is that variables that are similar when construed as numbers are not necessarily similar when construed as strings. Consider 199 and 200 -- they are similar numbers but not similar strings.

I would love to find a record linkage command that works for numeric variables. Is there one?
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2404

05 Aug 2020, 14:39

Here's a brute force approach using Mata, which I have not particularly checked. There are probably more elegant ways to do this, but this seems to be quick enough even at N = 80. It results in a set of variables, closeID1-closeIDN, where closeID_i contains the observation number of the the i_th closest observation. Note that closeID1 always == the observation itself.

Code:

// Simulate data and distance matrix.
set seed 8466
clear
local nobs = 4 // 80 in real life
set obs `nobs'
gen int id = _n  // id with observation numbers.
// Put your distance matrix into Mata. I assume that its row and col indexes
// correspond to the id variable (observation numbers).
// First,  I simulate that distance matrix in Mata.
mata: D = makesymmetric(runiform(`nobs',`nobs'))
mata: _diag(D, 0) // i, i is self-distance and must be 0
// Just for display to check the results.
mata: st_matrix("D",D)
mat list D, nohalf
//
// Make an id column vector that will be appended on the right of each
// observation's distance column vector.
sort id
putmata id = id
// closeID[i, j] will hold the obs. number of the j_th closest id to obs. i
mata: closeID = J(`=_N', `=_N', .)
//
forval i = 1/`=_N' {
   // Sort each observation's column of distances and its associated id vector
   mata: sortcol = sort((D[.,`i'],id), 1)
   // 2nd col of sortcol are now the closest IDs, in order,  for the i_th observation,
   // so save it.
   mata: closeID[`i', .] = sortcol[.,2]'
}
// Variables in Stata
getmata (closeID*) = closeID
list id closeID2-closeID`=_N'

Last edited by Mike Lacy; 05 Aug 2020, 14:56.

Comment

paulvonhippel

Join Date: Apr 2014

Posts: 496
#8

05 Aug 2020, 23:37

Cool, Mike Lacy! I encourage you to develop this into a command and publish it on SSC. It will have a lot of applications.
Comment

Announcement

Record linkage with quantitative variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment