Determining if two variables are a 1:1 match

Joe Canner

Join Date: Mar 2014

Posts: 580
#1

Determining if two variables are a 1:1 match

06 Jan 2016, 11:17

Dear Stata Colleagues,

Before I expend too much time and energy developing a useful tool, I wonder if anyone else has done this before or has any suggestions. I'm not even sure what the appropriate terminology is, so I'm not sure if I have exhaustively searched the web for an answer. Here is what I want to do:

Suppose I have a large data set with two descriptive variables (say, both text or one text and one numeric) that have many unique values. I want to determine if the two variables always match up together in the same way (e.g, A always goes with 1, B always goes with 2, etc.). I might want to do this to eliminate redundancy if the two variables tell me the same thing. Further, if they are not a 1:1 match, for descriptive purposes I may want to know if the two variables are related 1:m, m:1, or m:m. It may even be useful to know what "m" is (what is the min, max, mean, or median number of unique values of one variable that are matched with a single value of the other variable?), again for descriptive purpose, or for cleaning purposes if "m" is small and a few fixes could make the match 1:1.

Note that because this is a large data with a lot of unique values, it is not practical to do a cross tabulation of the two variables to see if all of the rows and columns have only one non-zero entry.

Is there a command in Stata that does this? Does anyone have any suggestions for how to do this? Here is one possibility that I came up with (suppose the two variables are X and Y):

Code:

gen ID=_n // or any other nonmissing value collapse (count) ID, by(X Y) // could also do duplicates drop X Y, force bys X: gen n1=_N bys Y: gen n2=_N tab1 n1 n2

If n1 is always 1 and n2 is always 1 then X and Y are a 1:1 match. Otherwise, they can be tabulated and summarized to describe the values of "m" (as defined above) for X and Y.

Is there a better way?

Regards,
Joe
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35436
#2

06 Jan 2016, 11:36

http://www.stata.com/support/faqs/da...ions-in-group/ is, I think, a start on the easy part of your problem

Code:

bysort X (Y) : gen match1 = Y[1] == Y[_N]

yields 1s in every observation if and only if Y is identical given X. But there is the converse case too. Thus every mother is female, but not every female is a mother.

As often, I'd advise distinguishing clearly between unique values (which occur once only) and distinct values.

Your discussion focuses on the case in which values are unique as well as distinct, but (unless you are otherwise inclined), values could match even if not
unique.

assert can be very useful for checks.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#3

06 Jan 2016, 11:46

Nick,

You are correct, I was using "unique" when I meant "distinct". I would like to develop a tool that would look at distinct values, whether or not they are unique. The collapse (or duplicates drop) portion of my proposed solution is an attempt to reduce the data set to the more manageable case where all values are unique. I suppose someone might find it useful to know if there is a 1:1 match between unique values. That is not my immediate concern, but I suppose it might be worth expanding to that scenario at some point.

Thank you for the code suggestion. It certainly looks useful from an efficiency standpoint, especially since it does not destroy the original data set.

Regards,
Joe
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35436
#4

06 Jan 2016, 12:01

You're right to point out diplomatically that you collapse first, so distinct values all become unique.

Extending my example, you could have

Code:

bysort X (Y) : gen matchXY = Y[1] == Y[_N] bysort Y (X) : gen matchYX = X[1] == X[_N] tab matchYX mathXY gen match = matchXY & matchYX

The 2 x 2 table could also be the focus of interest.

For a program I would suggest

1. Doing it in Mata and spitting out the 2 x 2 table.

2. Thinking first what you want to do about missings. You can ignore them or include them.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35436

06 Jan 2016, 15:25

Here's a program. (My second thoughts were that Mata isn't needed.) Does this help?

Code:

*! 1.0.0 NJC 6 Jan 2016
program matchcount, sortpreserve
    version 8.2
    syntax varlist(min=2 max=2) [if] [in] [, MISSing * ]

    if "`missing'" != "" marksample touse, novarlist strok
    else marksample touse, strok  

    qui count if `touse'
    if r(N) == 0 error 2000

    tokenize "`varlist'"
    args x y

    tempvar m1 m2  
    bysort `touse' `x' (`y') : gen byte `m1' = `y'[1] == `y'[_N]
    bysort `touse' `y' (`x') : gen byte `m2' = `x'[1] == `x'[_N]
    label var `m1' "`x' => `y'"
    label var `m2' "`y' => `x'"

    tab `m1' `m2' if `touse', `options'  
end

Some silly examples:

Code:

. sysuse auto, clear
(1978 Automobile Data)

. matchcount turn trunk

   turn => |     trunk => turn
     trunk |         0          1 |     Total
-----------+----------------------+----------
         0 |        66          4 |        70
         1 |         3          1 |         4
-----------+----------------------+----------
     Total |        69          5 |        74

. matchcount make price

           |  price =>
   make => |    make
     price |         1 |     Total
-----------+-----------+----------
         1 |        74 |        74
-----------+-----------+----------
     Total |        74 |        74

. matchcount make price, matcell(foo)

           |  price =>
   make => |    make
     price |         1 |     Total
-----------+-----------+----------
         1 |        74 |        74
-----------+-----------+----------
     Total |        74 |        74


. mat li foo

symmetric foo[1,1]
    c1
r1  74

. matchcount turn trunk, matcell(foo)

   turn => |     trunk => turn
     trunk |         0          1 |     Total
-----------+----------------------+----------
         0 |        66          4 |        70
         1 |         3          1 |         4
-----------+----------------------+----------
     Total |        69          5 |        74


. mat li foo

foo[2,2]
    c1  c2
r1  66   4
r2   3   1

Here "=>" has the precise meaning of "given a value of the first variable, then only one distinct value of the second variable is found", as in "Stata" => "good".

Last edited by Nick Cox; 06 Jan 2016, 15:28.

Announcement

Determining if two variables are a 1:1 match

Comment

Comment

Comment

Comment