Importing data, removing duplicate variables

Konrad Zdeb

Join Date: Apr 2014
Posts: 496

Importing data, removing duplicate variables

09 May 2014, 06:02

Colleagues,

I'm forced to merge two data sets with potentially duplicate variables. The situation looks like in the table below:

Observation	var1	var2	...	var59	IndicatorABC (same as var2)	IndicatorDEF	IndicatorXYZ (same as var59)
AAA1	1	23	...	63	23	54	63
AAA2	5	23	...	69	23	45	69
AAA3	63	5	...	98	5	654	98
AAA4	21	8	...	45	8	45	45
AAA5	54	2	...	4	2	24123	4
AAA6	2	1	...	1	1	54	1

What would be the smart way to manage this problem. Ideally I would like to keep the dataset with indicators and variable containing original values. I have no way of deciphering which data is already in the dataset using variable names. The only way to do it is to look at the variable content, then it becomes apparent to some indicators were entered into the data on multiple occasions unnecessarily.

Kind regards,
Konrad
Version: Stata/IC 13.1

Tags: duplicates, observations, variables

Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#2

09 May 2014, 06:22

I'm assuming that, as above, the potentially duplicate variables have different names (if they have the same names in each file, then a judicious use of options "update" and "replace" in your -merge- command will get you what you want); with different names, I would probably set up a loop (-forval- or -foreach- depending on the varnames), possibly with a second nested loop depending on what you know about which variables are "potential duplicates" and use -assert- (see -h assert-); if assert is false move to next comparison and if assert is true (i.e., the variables are the same), then drop one of them and then move to the next comparison
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

09 May 2014, 06:25

Interesting question. But even all values of two variables being equal doesn't mean that either variable can be dropped without loss. There might be crucial detail in variable labels, value labels (for numeric variables) and characteristics.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

09 May 2014, 06:46

I would loop over all possible pairs of variables formed by taking one from the original data set and one from the new set. If a particular comparison shows no differences, drop the one from the new data set.

Code:

local orig  varnameA-varnameB  // whatever the original and new variable lists are
local new varnameX-varnameY
// Compressing may help avoid problems with comparisons of variables
// differently typed in the two data sets
compress *
foreach o of local orig {
   foreach n of local new {
       capture assert `o' == `n'
       if (_rc ==9)  {
            drop `n'
      }
    }
}

Regards, Mike

Comment

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#5

09 May 2014, 06:47

Rich/Nick,

Thank you very much for your comments. Nick, I agree with what you are saying about the variable labels. In the context of thetask at hand the information on variable labels can be safely disregarded. The data files are imported from CSV files and contain only variable names and values. Rich, yes the difficulty is that variables have different names. The datasets were compiled independently and not follow the same naming convention but utilise the same measures and correspond to the same geographies (observations). My estimate is that the datasets overlap in approximately 20%, judging only by correlation coefficients equal 1 that should not appear in the data of this type. Of course, correlation is not sufficient to remove supposedly duplicated variables.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#6

09 May 2014, 06:50

Konrad,

If your goal is to easily identify which variables duplicate other variables, you can use pwcorr to find the correlation between all pairs of variables and then look for correlations that are close to 1. If you would like to automate the search you could loop through all combinations of variables and display those that with correlations close to 1:

Code:

foreach var1 of varlist * { foreach var2 of varlist * { if ("`var1'" != "`var2'") { local r=0 capture corr `var1' `var2' if _rc==0 { matrix A=r(C) local r=A[2,1] if `r'>0.99 & `r'<=1 { di "`var1' = `var2' (corr=`r')" } } } } }

If the variables in question are not numeric, that's another story...

Regards,
Joe
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#7

09 May 2014, 06:51

Originally posted by Mike Lacy View Post

I would loop over all possible pairs of variables formed by taking one from the original data set and one from the new set. If a particular comparison shows no differences, drop the one from the new data set.

Mike,

thank you very much for this. I will explore this solution.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#8

09 May 2014, 07:39

Konrad,

Mike's solution is probably more reliable and useful than mine, but I think you will want to compare _rc to 0 (assertion is true, i.e., two variables are equal) rather than to 9 (assertion is false, variables are different in at least one observation).

Incidentally, you might find it useful to remove the capture prefix from the assert command, because assert produces useful information on how many discrepancies there are between variables. You might find that some variables are different in only a few observations, which might be of interest.

Regards,
Joe
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#9

09 May 2014, 08:19

Thanks, I am running this code right now. It's interesting and rather time consuming. It returns rather interesting results, some variables are almost perfectly correlated:

Code:

cprneia8dc = cprneib8dc (corr=.9928705908535508) incscr10p = incdep11 (corr=.9991476085633249)

How to explain this, I understand that those variables are almost identical. I am wondering why look at correlations of >0.99 instead of only those equal 1? I would assume that either variables are identical or not?

Last edited by Konrad Zdeb; 09 May 2014, 08:21.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#10

09 May 2014, 11:32

How was the data created? If the data was created via some automatic process of importing values from an outside source then I would expect (assuming that the processes worked as intended) that the only difference between variables that ostensibly came from the same source but with different naming conventions would be variable name. However, the more human intervention was involved in the process the more I would expect differences to arise between the two datasets. If manual copying and pasting or actual data entry was involved in creating the data I would be dumbfounded if you didn't find small differences between the two data sets on some of the variables from the same source.

Do you have documentation for these datasets? Can you match up which variables you expect to be identical across the two sources? If not, can you request documentation for the two datasets? You really shouldn't have to guess which variables represent what.
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#11

09 May 2014, 12:04

Konrad,

You should try Mike Lacy's code with my suggested modifications; assert will tell you exactly how many observations are different. Alternatively, you could just list the cases that are different (list cprneia8dc cprneib8dc if cprneia8dc != cprneib8dc) and see how different they are. If they look the same, perhaps the difference is one of precision, so check the storage types.

Regards,
Joe
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#12

09 May 2014, 15:35

Joe, thanks for correcting my mistake. Note also that in my code fragment, it would be necessary to unabbreviate the varlists into orig and new:

Code:

unab orig: varnameA-varnameB unab new: varnameX-varnameY

Regards, Mike
Comment

Announcement

Importing data, removing duplicate variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment