Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing data, removing duplicate variables

    Colleagues,

    I'm forced to merge two data sets with potentially duplicate variables. The situation looks like in the table below:
    Observation var1 var2 ... var59 IndicatorABC (same as var2) IndicatorDEF IndicatorXYZ (same as var59)
    AAA1 1 23 ... 63 23 54 63
    AAA2 5 23 ... 69 23 45 69
    AAA3 63 5 ... 98 5 654 98
    AAA4 21 8 ... 45 8 45 45
    AAA5 54 2 ... 4 2 24123 4
    AAA6 2 1 ... 1 1 54 1
    What would be the smart way to manage this problem. Ideally I would like to keep the dataset with indicators and variable containing original values. I have no way of deciphering which data is already in the dataset using variable names. The only way to do it is to look at the variable content, then it becomes apparent to some indicators were entered into the data on multiple occasions unnecessarily.
    Kind regards,
    Konrad
    Version: Stata/IC 13.1

  • #2
    I'm assuming that, as above, the potentially duplicate variables have different names (if they have the same names in each file, then a judicious use of options "update" and "replace" in your -merge- command will get you what you want); with different names, I would probably set up a loop (-forval- or -foreach- depending on the varnames), possibly with a second nested loop depending on what you know about which variables are "potential duplicates" and use -assert- (see -h assert-); if assert is false move to next comparison and if assert is true (i.e., the variables are the same), then drop one of them and then move to the next comparison

    Comment


    • #3
      Interesting question. But even all values of two variables being equal doesn't mean that either variable can be dropped without loss. There might be crucial detail in variable labels, value labels (for numeric variables) and characteristics.

      Comment


      • #4
        I would loop over all possible pairs of variables formed by taking one from the original data set and one from the new set. If a particular comparison shows no differences, drop the one from the new data set.

        Code:
        local orig  varnameA-varnameB  // whatever the original and new variable lists are
        local new varnameX-varnameY
        // Compressing may help avoid problems with comparisons of variables
        // differently typed in the two data sets
        compress *
        foreach o of local orig {
           foreach n of local new {
               capture assert `o' == `n'
               if (_rc ==9)  {
                    drop `n'
              }
            }
        }
        Regards, Mike

        Comment


        • #5
          Rich/Nick,

          Thank you very much for your comments. Nick, I agree with what you are saying about the variable labels. In the context of thetask at hand the information on variable labels can be safely disregarded. The data files are imported from CSV files and contain only variable names and values. Rich, yes the difficulty is that variables have different names. The datasets were compiled independently and not follow the same naming convention but utilise the same measures and correspond to the same geographies (observations). My estimate is that the datasets overlap in approximately 20%, judging only by correlation coefficients equal 1 that should not appear in the data of this type. Of course, correlation is not sufficient to remove supposedly duplicated variables.
          Kind regards,
          Konrad
          Version: Stata/IC 13.1

          Comment


          • #6
            Konrad,

            If your goal is to easily identify which variables duplicate other variables, you can use pwcorr to find the correlation between all pairs of variables and then look for correlations that are close to 1. If you would like to automate the search you could loop through all combinations of variables and display those that with correlations close to 1:

            Code:
            foreach var1 of varlist * {
              foreach var2 of varlist * {
                 if ("`var1'" != "`var2'") {
                   local r=0
                   capture corr `var1' `var2'
                   if _rc==0 {
                     matrix A=r(C)
                     local r=A[2,1]
                     if `r'>0.99 & `r'<=1  {
                      di "`var1' = `var2' (corr=`r')"
                   }
                 }
               }
              }
            }
            If the variables in question are not numeric, that's another story...

            Regards,
            Joe

            Comment


            • #7
              Originally posted by Mike Lacy View Post
              I would loop over all possible pairs of variables formed by taking one from the original data set and one from the new set. If a particular comparison shows no differences, drop the one from the new data set.
              Mike,

              thank you very much for this. I will explore this solution.
              Kind regards,
              Konrad
              Version: Stata/IC 13.1

              Comment


              • #8
                Konrad,

                Mike's solution is probably more reliable and useful than mine, but I think you will want to compare _rc to 0 (assertion is true, i.e., two variables are equal) rather than to 9 (assertion is false, variables are different in at least one observation).

                Incidentally, you might find it useful to remove the capture prefix from the assert command, because assert produces useful information on how many discrepancies there are between variables. You might find that some variables are different in only a few observations, which might be of interest.

                Regards,
                Joe

                Comment


                • #9
                  Thanks, I am running this code right now. It's interesting and rather time consuming. It returns rather interesting results, some variables are almost perfectly correlated:

                  Code:
                  cprneia8dc = cprneib8dc (corr=.9928705908535508)
                  incscr10p = incdep11 (corr=.9991476085633249)
                  How to explain this, I understand that those variables are almost identical. I am wondering why look at correlations of >0.99 instead of only those equal 1? I would assume that either variables are identical or not?
                  Last edited by Konrad Zdeb; 09 May 2014, 09:21.
                  Kind regards,
                  Konrad
                  Version: Stata/IC 13.1

                  Comment


                  • #10
                    How was the data created? If the data was created via some automatic process of importing values from an outside source then I would expect (assuming that the processes worked as intended) that the only difference between variables that ostensibly came from the same source but with different naming conventions would be variable name. However, the more human intervention was involved in the process the more I would expect differences to arise between the two datasets. If manual copying and pasting or actual data entry was involved in creating the data I would be dumbfounded if you didn't find small differences between the two data sets on some of the variables from the same source.

                    Do you have documentation for these datasets? Can you match up which variables you expect to be identical across the two sources? If not, can you request documentation for the two datasets? You really shouldn't have to guess which variables represent what.

                    Comment


                    • #11
                      Konrad,

                      You should try Mike Lacy's code with my suggested modifications; assert will tell you exactly how many observations are different. Alternatively, you could just list the cases that are different (list cprneia8dc cprneib8dc if cprneia8dc != cprneib8dc) and see how different they are. If they look the same, perhaps the difference is one of precision, so check the storage types.

                      Regards,
                      Joe

                      Comment


                      • #12
                        Joe, thanks for correcting my mistake. Note also that in my code fragment, it would be necessary to unabbreviate the varlists into orig and new:

                        Code:
                        unab orig: varnameA-varnameB
                        unab new: varnameX-varnameY

                        Regards, Mike

                        Comment

                        Working...
                        X