Compare datasets with different n

llmiller

Join Date: Jul 2014

Posts: 10
#1

Compare datasets with different n

04 Sep 2014, 04:39

Can anyone recommend an approach to comparing two datasets when the n is different?

I have over 60 datasets with a lot of the same variables across the files. I have reduced these to 27 datasets using the dta_equal command and merging after dealing with any discrepancies. I also looked at the cf command but neither of these work when the n is different in the two files. I am going to try merging some files (and only save the merged file if _merge =1-3) but I have already come across several files where _merge=5 and I'm not sure how to approach this. If it was just a few variables I would rename the variables in one file with suffix _1 or similar, merge the files and then compare the variables directly but I have around 4000 variables and 20000 observations. I realise this is a somewhat clumsy approach so any suggestions would be welcomed.

Regards

Laura
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35432
#2

04 Sep 2014, 05:02

Much depends on what you expect. In some problems appending datasets and looking for duplicates may help, but indirectly by indicating observations in only one original file as singletons.
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#3

04 Sep 2014, 05:02

I would suggest that you explore cf2 and cf3 both available via the ssc. You may also have a look at cfby in order to produce discrepancy rate and cfvars (ssc) to compare lists of variable names.
As this is Stata forum it is not necessarily appropriate for me to suggest solution that would use other software but considering the fact that I have no clue how to build similar solution using Stata, I would suggest that you have a look at the compare package in R. Personally, I don't find it easiest to use, but the package would enable you to specify model where certain transformations on the object would performed in order to check whether the new object matches the "ideal" object. So for instance, if you have one clean data set and some data sets that you received by data providers and you presume that in the received data sets missing data could be coded as #, missing, -, . and names could be unnecessarily capitalised you could instruct the compare package to apply those transformations to the new object and check whether they would match the "ideal" clean data set.

Last edited by Konrad Zdeb; 04 Sep 2014, 05:16. Reason: Content.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
llmiller

Join Date: Jul 2014

Posts: 10
#4

04 Sep 2014, 06:31

Nick: Thanks for your response Nick. I'm not sure how efficient append would be here. There are lots of issues that I'm expecting, including:
* variables with missings in one file but populated in the other (update options deals with these)
* variables with missings coded in detail in some files but just . in others
* variables with lowercase/uppercase discrepancies in value labels / strings
* variables with no and yes coded 0 and 1 in one file but 1 and 2 in another
* variables with different formats (storage type or length) as they were stat transfered from SPSS at different times and the way stat transfer did this changed
* and of course some variables where the variables are just different (I haven't yet established why).

Unfortunately it may not be possible to deal with all of these in the same way. I'm sure there will be other issues that I haven't thought of too.

The datafiles came about from some work I did a while ago where I kept separate files for different analyses and I need to be able to merge them all into one big datafile. However, during the analysis process I renamed some variables, recoded others, restricted some to only be created if they were useful for that particular analyses such that they *should* be the same across the files but I need to check before combining them as there are some that are different and I don't know why. I don't have all the syntax files that I used to create them.

I'd like to be able to say something like

dta_equal "file1" "file2", uniq1 uniq2 option3

where "option3" specifies that it checks the values of all variables if the observation is in both files (e.g. by specifying an id variable), thereby different n would not be a problem .

The only way I can think of to do this is to rename them all with the suffix as in my original post, then compare them using the compare command after merging but with the number of variables I have that is not viable.

Does that make any other ideas come to mind?

Konrad: Many thanks for these suggestions - I will have a look through them and see what options they allow.

Laura
Comment
Matthew White

Join Date: Apr 2014

Posts: 29
#5

10 Sep 2014, 11:11

Another option is the SSC program cfout, which I just updated. It allows you to compare two datasets, outputting a dataset of differences.
Comment

Announcement

Compare datasets with different n

Comment

Comment

Comment

Comment