Dear Stata Colleagues,
Before I expend too much time and energy developing a useful tool, I wonder if anyone else has done this before or has any suggestions. I'm not even sure what the appropriate terminology is, so I'm not sure if I have exhaustively searched the web for an answer. Here is what I want to do:
Suppose I have a large data set with two descriptive variables (say, both text or one text and one numeric) that have many unique values. I want to determine if the two variables always match up together in the same way (e.g, A always goes with 1, B always goes with 2, etc.). I might want to do this to eliminate redundancy if the two variables tell me the same thing. Further, if they are not a 1:1 match, for descriptive purposes I may want to know if the two variables are related 1:m, m:1, or m:m. It may even be useful to know what "m" is (what is the min, max, mean, or median number of unique values of one variable that are matched with a single value of the other variable?), again for descriptive purpose, or for cleaning purposes if "m" is small and a few fixes could make the match 1:1.
Note that because this is a large data with a lot of unique values, it is not practical to do a cross tabulation of the two variables to see if all of the rows and columns have only one non-zero entry.
Is there a command in Stata that does this? Does anyone have any suggestions for how to do this? Here is one possibility that I came up with (suppose the two variables are X and Y):
If n1 is always 1 and n2 is always 1 then X and Y are a 1:1 match. Otherwise, they can be tabulated and summarized to describe the values of "m" (as defined above) for X and Y.
Is there a better way?
Regards,
Joe
Before I expend too much time and energy developing a useful tool, I wonder if anyone else has done this before or has any suggestions. I'm not even sure what the appropriate terminology is, so I'm not sure if I have exhaustively searched the web for an answer. Here is what I want to do:
Suppose I have a large data set with two descriptive variables (say, both text or one text and one numeric) that have many unique values. I want to determine if the two variables always match up together in the same way (e.g, A always goes with 1, B always goes with 2, etc.). I might want to do this to eliminate redundancy if the two variables tell me the same thing. Further, if they are not a 1:1 match, for descriptive purposes I may want to know if the two variables are related 1:m, m:1, or m:m. It may even be useful to know what "m" is (what is the min, max, mean, or median number of unique values of one variable that are matched with a single value of the other variable?), again for descriptive purpose, or for cleaning purposes if "m" is small and a few fixes could make the match 1:1.
Note that because this is a large data with a lot of unique values, it is not practical to do a cross tabulation of the two variables to see if all of the rows and columns have only one non-zero entry.
Is there a command in Stata that does this? Does anyone have any suggestions for how to do this? Here is one possibility that I came up with (suppose the two variables are X and Y):
Code:
gen ID=_n // or any other nonmissing value collapse (count) ID, by(X Y) // could also do duplicates drop X Y, force bys X: gen n1=_N bys Y: gen n2=_N tab1 n1 n2
Is there a better way?
Regards,
Joe
Comment