Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Determining if two variables are a 1:1 match

    Dear Stata Colleagues,

    Before I expend too much time and energy developing a useful tool, I wonder if anyone else has done this before or has any suggestions. I'm not even sure what the appropriate terminology is, so I'm not sure if I have exhaustively searched the web for an answer. Here is what I want to do:

    Suppose I have a large data set with two descriptive variables (say, both text or one text and one numeric) that have many unique values. I want to determine if the two variables always match up together in the same way (e.g, A always goes with 1, B always goes with 2, etc.). I might want to do this to eliminate redundancy if the two variables tell me the same thing. Further, if they are not a 1:1 match, for descriptive purposes I may want to know if the two variables are related 1:m, m:1, or m:m. It may even be useful to know what "m" is (what is the min, max, mean, or median number of unique values of one variable that are matched with a single value of the other variable?), again for descriptive purpose, or for cleaning purposes if "m" is small and a few fixes could make the match 1:1.

    Note that because this is a large data with a lot of unique values, it is not practical to do a cross tabulation of the two variables to see if all of the rows and columns have only one non-zero entry.

    Is there a command in Stata that does this? Does anyone have any suggestions for how to do this? Here is one possibility that I came up with (suppose the two variables are X and Y):

    Code:
    gen ID=_n   // or any other nonmissing value
    collapse (count) ID, by(X Y)   // could also do duplicates drop X Y, force
    bys X: gen n1=_N
    bys Y: gen n2=_N
    tab1 n1 n2
    If n1 is always 1 and n2 is always 1 then X and Y are a 1:1 match. Otherwise, they can be tabulated and summarized to describe the values of "m" (as defined above) for X and Y.

    Is there a better way?

    Regards,
    Joe

  • #2
    http://www.stata.com/support/faqs/da...ions-in-group/ is, I think, a start on the easy part of your problem

    Code:
     
    bysort X (Y) : gen match1 = Y[1] == Y[_N]
    yields 1s in every observation if and only if Y is identical given X. But there is the converse case too. Thus every mother is female, but not every female is a mother.

    As often, I'd advise distinguishing clearly between unique values (which occur once only) and distinct values.

    Your discussion focuses on the case in which values are unique as well as distinct, but (unless you are otherwise inclined), values could match even if not
    unique.

    assert can be very useful for checks.

    Comment


    • #3
      Nick,

      You are correct, I was using "unique" when I meant "distinct". I would like to develop a tool that would look at distinct values, whether or not they are unique. The collapse (or duplicates drop) portion of my proposed solution is an attempt to reduce the data set to the more manageable case where all values are unique. I suppose someone might find it useful to know if there is a 1:1 match between unique values. That is not my immediate concern, but I suppose it might be worth expanding to that scenario at some point.

      Thank you for the code suggestion. It certainly looks useful from an efficiency standpoint, especially since it does not destroy the original data set.

      Regards,
      Joe

      Comment


      • #4
        You're right to point out diplomatically that you collapse first, so distinct values all become unique.

        Extending my example, you could have

        Code:
        bysort X (Y) : gen matchXY = Y[1] == Y[_N]  
        bysort Y (X) : gen matchYX = X[1] == X[_N]
        tab matchYX mathXY
        gen match = matchXY & matchYX
        The 2 x 2 table could also be the focus of interest.

        For a program I would suggest

        1. Doing it in Mata and spitting out the 2 x 2 table.

        2. Thinking first what you want to do about missings. You can ignore them or include them.

        Comment


        • #5
          Here's a program. (My second thoughts were that Mata isn't needed.) Does this help?

          Code:
          *! 1.0.0 NJC 6 Jan 2016
          program matchcount, sortpreserve
              version 8.2
              syntax varlist(min=2 max=2) [if] [in] [, MISSing * ]
          
              if "`missing'" != "" marksample touse, novarlist strok
              else marksample touse, strok  
          
              qui count if `touse'
              if r(N) == 0 error 2000
          
              tokenize "`varlist'"
              args x y
          
              tempvar m1 m2  
              bysort `touse' `x' (`y') : gen byte `m1' = `y'[1] == `y'[_N]
              bysort `touse' `y' (`x') : gen byte `m2' = `x'[1] == `x'[_N]
              label var `m1' "`x' => `y'"
              label var `m2' "`y' => `x'"
          
              tab `m1' `m2' if `touse', `options'  
          end
          Some silly examples:

          Code:
          . sysuse auto, clear
          (1978 Automobile Data)
          
          . matchcount turn trunk
          
             turn => |     trunk => turn
               trunk |         0          1 |     Total
          -----------+----------------------+----------
                   0 |        66          4 |        70
                   1 |         3          1 |         4
          -----------+----------------------+----------
               Total |        69          5 |        74
          
          . matchcount make price
          
                     |  price =>
             make => |    make
               price |         1 |     Total
          -----------+-----------+----------
                   1 |        74 |        74
          -----------+-----------+----------
               Total |        74 |        74
          
          . matchcount make price, matcell(foo)
          
                     |  price =>
             make => |    make
               price |         1 |     Total
          -----------+-----------+----------
                   1 |        74 |        74
          -----------+-----------+----------
               Total |        74 |        74
          
          
          . mat li foo
          
          symmetric foo[1,1]
              c1
          r1  74
          
          . matchcount turn trunk, matcell(foo)
          
             turn => |     trunk => turn
               trunk |         0          1 |     Total
          -----------+----------------------+----------
                   0 |        66          4 |        70
                   1 |         3          1 |         4
          -----------+----------------------+----------
               Total |        69          5 |        74
          
          
          . mat li foo
          
          foo[2,2]
              c1  c2
          r1  66   4
          r2   3   1
          Here "=>" has the precise meaning of "given a value of the first variable, then only one distinct value of the second variable is found", as in "Stata" => "good".
          Last edited by Nick Cox; 06 Jan 2016, 16:28.

          Comment

          Working...
          X