Compare two datasets

Nicolas Orgeira

Join Date: Sep 2015

Posts: 165
#1

Compare two datasets

21 Dec 2017, 15:15

Hi,

I would like to compare two datasets which should be the same - one of them has some string variables encoded while the other does not. The datasets are very large (up to 30,000 variables each) so I would like to avoid tabbing each variable for each dataset and compare. Is there a faster way?

Thank you
Tags: None
Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

21 Dec 2017, 15:52

See help cf.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

21 Dec 2017, 15:52

Perhaps the output of help cf will help you find a direction.

Added in edit - it's a tie! Both answers in at 52 minutes past the hour.
1 like
Comment
Nicolas Orgeira

Join Date: Sep 2015

Posts: 165
#4

21 Dec 2017, 16:02

Thanks Robert and William for your message. Impressive timing!

I checked cf but unfortunately it only compares the variable values, which wouldn't work if one variable is encoded and the other isn't. :/
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

21 Dec 2017, 18:01

Nothing will do what you want for the encoded variables without first decoding them back to string variables, or using their formats to similarly encode the string variables in the other dataset (assuming you meant that the encoded variables were encoded by the Stata encode command and had value labels created that use the original string values)

The cf command will compare a subset of variables, so something like this untested code may get you started. I assume f1 is the file that has some string variables and f2 is the file in which they are encoded. This code compares just the numeric variables.

Code:

use f1, clear ds, has(type numeric) local nv = `r(varlist)' cf `nv' using f2
Comment
Mead Over

Join Date: Sep 2014

Posts: 110
#6

08 Feb 2021, 15:57

Comparing data files and variables

When a project’s data evolves over time, one frequently needs to compare two versions of a similar variable, either in the same dataset or in different datasets. This post compares and contrasts some of Stata’s utilities that are useful for this purpose and also offers three utilities I’ve written which attempt to enhance the features of Stata’s commands.

Stata's -compare- reports the differences and similarities between two variables with different names located in the same dataset. My wrapper program -compare2-, unlike -compare-, also returns stored results in Stata's return space for subsequent use by the programmer. With the added -reldif- option, -compare2- presents the summary statistics of the relative difference between the two variables as computed by the Stata function -reldif-. See help for -reldif-. For ease of use, -compare2- has a companion dialog.

As mentioned above, Stata's -cf- command is a powerful tool for comparing variables in a "master" data set in memory to identically named variables in a saved data set on disk. But -cf- fails when the two data sets have different numbers of observations or when the only difference between two data sets is the way they are sorted. -cf2- is a wrapper for Stata's -cf- which first sorts the two datasets and then compares identically named variables on only those observations that match according to the sorting variables. For ease of use, -cf2- has a companion dialog.

The commands -cf- and -cf2- report mismatches between variables with the same name located in different data sets. When the variables being compared are numerical and are both in the current data set, -compare- or -compare2- provides a more complete analysis of differences. To obtain the more detailed comparison in the style of -compare- for variables with the same names located in different datasets, try the program -compuse- which also pops up a graph of one version of each variable against the version in the other dataset.

Both -cf2- and -compuse- have the required option sortvars(varlist) to specify the variable or variables which uniquely sort the two compared datasets. Stata refers to such a set of variables as an "ID", while others refer to them as "key" variables. Prior to executing either -cf2- or -compuse-, the user should confirm that the proposed sort variables do indeed uniquely identify the observations in both data sets. Stata's -isid- serves this purpose for both the master and for the -using- dataset. Also see Stata’s -dta_equal- and the community contributed commands -assertky- and -findunique-.

The commands listed above should all be findable using Stata’s
search -commandname-
Those I’ve written can be downloaded from:
view net from "http://digital.cgdev.org/doc/stata/MO/Misc"
I welcome questions or bug reports - or the news that Stata has updated its own -cf- and -compare- commands to offer similar options.

Last edited by Mead Over; 08 Feb 2021, 16:00.
1 like
Comment
anjana rajendra

Join Date: Apr 2021

Posts: 36
#7

26 Apr 2021, 01:22

Hi Mead,

Thank you for the detailed explanation. I'm unable to find compuse in stata command search instead it gives us "ip17" to install. Can you please elaborate on this?
Because, I'm trying to merge two longitudinal datasets using two different methods and softwares. Both methods are giving a difference in total observations in the merged datasets. hence, to validate and see why there is a difference in datasets merged with different techniques I need to compare the differences in the merged datasets.Please let me know how to compare two different datasets with different number of observations. Thank you so much!
Comment
Mead Over

Join Date: Sep 2014

Posts: 110
#8

16 May 2021, 11:27

Sorry not to see this earlier, @anjana rajendra .

Compuse can be found here:

Code:

view net describe compuse, from(http://digital.cgdev.org/doc/stata/MO/Misc)

And cf2 here:

Code:

view net describe cf2, from(http://digital.cgdev.org/doc/stata/MO/Misc)
Comment

Announcement

Compare two datasets

Comment

Comment

Comment

Comment

Comment

Comment

Comment