Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Record linkage with quantitative variables

    I have two datasets describing the same 80 schools.

    Some variables are similar -- for example, both lists include the schools' enrollment and % African American -- but there is no common ID or other variable that matches the two lists perfectly.

    Is there a command that will find the best way to link the two datasets?

    If not, I've done much of the work. I've made an 80x80=6400 row dataset that gives the Mahalanobis distance between every possible combination of a row from the first dataset with a row from the second. Now how do I optimally sort and deduplicate 6400 rows to get the best 80 matches, with no duplicates from either list?

  • #2
    Consider sorting on the variables that you think that (almost?) identify observations in your datasets, and then matching by observations number.

    Comment


    • #3
      paulvonhippel
      Are the data based on schools in the US? Just wondering if using any publicly available data might be helpful/useful in that instance.

      Comment


      • #4
        Yes, it's a small sample of US schools which I am linking to the Common Core of Data. The linking variables match imperfectly, hence my question.

        Comment


        • #5
          I clearly don't completely understand your data, but you might want to take a look at two user-written commands: -reclink- and -matchit-; use -search- to find and download

          Comment


          • #6
            Thanks for the suggestions, Rich Goldstein. I believe that both -reclink- and -matchit- are for string variables. The problem is that variables that are similar when construed as numbers are not necessarily similar when construed as strings. Consider 199 and 200 -- they are similar numbers but not similar strings.

            I would love to find a record linkage command that works for numeric variables. Is there one?

            Comment


            • #7
              Here's a brute force approach using Mata, which I have not particularly checked. There are probably more elegant ways to do this, but this seems to be quick enough even at N = 80. It results in a set of variables, closeID1-closeIDN, where closeID_i contains the observation number of the the i_th closest observation. Note that closeID1 always == the observation itself.
              Code:
              // Simulate data and distance matrix.
              set seed 8466
              clear
              local nobs = 4 // 80 in real life
              set obs `nobs'
              gen int id = _n  // id with observation numbers.
              // Put your distance matrix into Mata. I assume that its row and col indexes
              // correspond to the id variable (observation numbers).
              // First,  I simulate that distance matrix in Mata.
              mata: D = makesymmetric(runiform(`nobs',`nobs'))
              mata: _diag(D, 0) // i, i is self-distance and must be 0
              // Just for display to check the results.
              mata: st_matrix("D",D)
              mat list D, nohalf
              //
              // Make an id column vector that will be appended on the right of each
              // observation's distance column vector.
              sort id
              putmata id = id
              // closeID[i, j] will hold the obs. number of the j_th closest id to obs. i
              mata: closeID = J(`=_N', `=_N', .)
              //
              forval i = 1/`=_N' {
                 // Sort each observation's column of distances and its associated id vector
                 mata: sortcol = sort((D[.,`i'],id), 1)
                 // 2nd col of sortcol are now the closest IDs, in order,  for the i_th observation,
                 // so save it.
                 mata: closeID[`i', .] = sortcol[.,2]'
              }
              // Variables in Stata
              getmata (closeID*) = closeID
              list id closeID2-closeID`=_N'


              
              Last edited by Mike Lacy; 05 Aug 2020, 14:56.

              Comment


              • #8
                Cool, Mike Lacy! I encourage you to develop this into a command and publish it on SSC. It will have a lot of applications.

                Comment

                Working...
                X