Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Compare two datasets

    Hi,

    I would like to compare two datasets which should be the same - one of them has some string variables encoded while the other does not. The datasets are very large (up to 30,000 variables each) so I would like to avoid tabbing each variable for each dataset and compare. Is there a faster way?

    Thank you

  • #2
    See help cf.

    Comment


    • #3
      Perhaps the output of help cf will help you find a direction.

      Added in edit - it's a tie! Both answers in at 52 minutes past the hour.

      Comment


      • #4
        Thanks Robert and William for your message. Impressive timing!

        I checked cf but unfortunately it only compares the variable values, which wouldn't work if one variable is encoded and the other isn't. :/

        Comment


        • #5
          Nothing will do what you want for the encoded variables without first decoding them back to string variables, or using their formats to similarly encode the string variables in the other dataset (assuming you meant that the encoded variables were encoded by the Stata encode command and had value labels created that use the original string values)

          The cf command will compare a subset of variables, so something like this untested code may get you started. I assume f1 is the file that has some string variables and f2 is the file in which they are encoded. This code compares just the numeric variables.
          Code:
          use f1, clear
          ds, has(type numeric)
          local nv = `r(varlist)'
          cf `nv' using f2

          Comment


          • #6
            Comparing data files and variables

            When a project’s data evolves over time, one frequently needs to compare two versions of a similar variable, either in the same dataset or in different datasets. This post compares and contrasts some of Stata’s utilities that are useful for this purpose and also offers three utilities I’ve written which attempt to enhance the features of Stata’s commands.

            Stata's -compare- reports the differences and similarities between two variables with different names located in the same dataset. My wrapper program -compare2-, unlike -compare-, also returns stored results in Stata's return space for subsequent use by the programmer. With the added -reldif- option, -compare2- presents the summary statistics of the relative difference between the two variables as computed by the Stata function -reldif-. See help for -reldif-. For ease of use, -compare2- has a companion dialog.

            As mentioned above, Stata's -cf- command is a powerful tool for comparing variables in a "master" data set in memory to identically named variables in a saved data set on disk. But -cf- fails when the two data sets have different numbers of observations or when the only difference between two data sets is the way they are sorted. -cf2- is a wrapper for Stata's -cf- which first sorts the two datasets and then compares identically named variables on only those observations that match according to the sorting variables. For ease of use, -cf2- has a companion dialog.

            The commands -cf- and -cf2- report mismatches between variables with the same name located in different data sets. When the variables being compared are numerical and are both in the current data set, -compare- or -compare2- provides a more complete analysis of differences. To obtain the more detailed comparison in the style of -compare- for variables with the same names located in different datasets, try the program -compuse- which also pops up a graph of one version of each variable against the version in the other dataset.

            Both -cf2- and -compuse- have the required option sortvars(varlist) to specify the variable or variables which uniquely sort the two compared datasets. Stata refers to such a set of variables as an "ID", while others refer to them as "key" variables. Prior to executing either -cf2- or -compuse-, the user should confirm that the proposed sort variables do indeed uniquely identify the observations in both data sets. Stata's -isid- serves this purpose for both the master and for the -using- dataset. Also see Stata’s -dta_equal- and the community contributed commands -assertky- and -findunique-.

            The commands listed above should all be findable using Stata’s
            search -commandname-
            Those I’ve written can be downloaded from:
            view net from "http://digital.cgdev.org/doc/stata/MO/Misc"
            I welcome questions or bug reports - or the news that Stata has updated its own -cf- and -compare- commands to offer similar options.
            Last edited by Mead Over; 08 Feb 2021, 17:00.

            Comment


            • #7
              Hi Mead,

              Thank you for the detailed explanation. I'm unable to find compuse in stata command search instead it gives us "ip17" to install. Can you please elaborate on this?
              Because, I'm trying to merge two longitudinal datasets using two different methods and softwares. Both methods are giving a difference in total observations in the merged datasets. hence, to validate and see why there is a difference in datasets merged with different techniques I need to compare the differences in the merged datasets.Please let me know how to compare two different datasets with different number of observations. Thank you so much!

              Comment


              • #8
                Sorry not to see this earlier, @anjana rajendra .

                Compuse can be found here:

                Code:
                view net describe compuse, from(http://digital.cgdev.org/doc/stata/MO/Misc)
                And cf2 here:

                Code:
                view net describe cf2, from(http://digital.cgdev.org/doc/stata/MO/Misc)

                Comment

                Working...
                X