I am having trouble working out how to deduce an association between two identifier variables. I will explain this with an example dataset below (real dataset is 400 x 500,000).
I want to know, for each unique ID1, how many different ID2's are associated with it. For instance, "165498" can be found to correspond to both "12abc" and "ef4". On the other hand, "798402" can only be found to correspond to "12abc". Dots represent missing values which I have in my dataset.
How can I find out if there are any situations where a certain ID1 is associated with more than one unique ID2? And can I then tag those particular ID1s to investigate further?
Many thanks in advance
ID1 | ID2 |
. | 12abc |
165498 | 12abc |
. | 12abc |
798402 | 12abc |
165498 | ef4 |
. | ef4 |
. | ef4 |
How can I find out if there are any situations where a certain ID1 is associated with more than one unique ID2? And can I then tag those particular ID1s to investigate further?
Many thanks in advance
Comment