Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable label overlap when merging datasets

    Hi,

    I am trying to merge two datasets and they happen to have the same variable label names.
    So when I merge 1:1 using individual id, the labels overlap.
    Below are parts of the datasets that contain the same name of variable labels.
    EX) 'label1' and 'label2' will be misleading.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte b1 long b2 byte(b3 b4)
    18     . .  .
    18     . .  .
    18     . .  .
    18     . .  .
    18     . .  .
    18     . .  .
    18     . .  .
     1     . .  .
    18     . .  .
    18     . .  .
     6  2026 1  4
    18     . .  .
    18     . .  .
    18     . .  .
    18     . .  .
     6  2208 1  6
     3  2208 1  3
     2  2839 1  0
     1     . .  .
    18     . .  .
     8  2326 1 11
     8  2190 1 10
    18     . .  .
    18     . .  .
     8  1028 1 11
    18     . .  .
    18     . .  .
    18     . .  .
     8  1033 1 11
    18     . .  .
     8  2055 1 11
     8  1033 1  8
    18     . .  .
    96     . .  .
     8  1038 1  8
     6  1035 1  3
     1     . .  .
    18     . .  .
    18     . .  .
    18     . .  .
    18     . .  .
     8  1028 1 10
     8  1033 1  9
     5  1014 1  6
    18     . .  .
    end
    label values b1 labels0
    label def labels0 1 "Nursery/day care", modify
    label def labels0 2 "Basic/infant/kindergarten", modify
    label def labels0 3 "Primary", modify
    label def labels0 5 "All age school", modify
    label def labels0 6 "Primary & Junior high", modify
    label def labels0 8 "Secondary High", modify
    label def labels0 11 "University", modify
    label def labels0 18 "None", modify
    label values b3 labels1
    label def labels1 1 "Public", modify
    label def labels1 2 "Private", modify
    label values b4 labels2
    label def labels2 2 "Primary 1-3", modify
    label def labels2 3 "Primary 4-6", modify
    label def labels2 4 "Grade 7", modify
    label def labels2 5 "Grade 8", modify
    label def labels2 6 "Grade 9", modify
    label def labels2 8 "Grade 11", modify
    label def labels2 9 "Grade 12", modify
    label def labels2 10 "Grade 13", modify
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(hhmember whynotme sex)
    2 4 2
    1 . 2
    2 1 2
    2 1 1
    2 1 1
    2 1 1
    1 . 2
    2 2 1
    1 . 2
    1 . 2
    1 . 2
    1 . 1
    1 . 2
    1 . 2
    1 . 2
    1 . 2
    1 . 2
    1 . 1
    1 . 1
    1 . 2
    1 . 1
    1 . 1
    1 . 2
    1 . 2
    2 6 1
    1 . 2
    1 . 2
    1 . 1
    1 . 1
    1 . 1
    3 . 2
    1 . 1
    1 . 1
    1 . 2
    1 . 2
    1 . 2
    1 . 1
    1 . 2
    1 . 1
    1 . 2
    1 . 2
    1 . 2
    end
    label values hhmember labels1
    label def labels1 1 "Still a member", modify
    label def labels1 2 "No Longer a member", modify
    label def labels1 3 "New member", modify
    label values whynotme labels2
    label def labels2 1 "Migrated to other household in Parish", modify
    label def labels2 2 "Migrated to another parish", modify
    label def labels2 4 "Died", modify
    label def labels2 6 "other", modify
    label values sex labels3
    label def labels3 1 "Male", modify
    label def labels3 2 "Female", modify

  • #2
    This is a task for Daniel Klein's -elabel- package, available from SSC.

    Code:
    use dataset1
    elabel dir
    elabel rename (`r(names)') =_1
    Do the same in dataset2, but with suffix _2 instead of _1. Now all the label names are distinct and you can combine them.

    That said, it is not clear how you can -merge- these two data sets: they may have common variable label names, but they don't have any common variables. So -merge- will simply fail no matter how you try to use it, and -append- will just produce a useless mishmosh. You can -cross- them, but that might or might not make any sense depending on what the variables in the first data set mean.

    Comment


    • #3
      This worked !! Thank you so much.
      One additional question. I would like to do that for all the datasets in one folder. I used the codes below:
      Code:
      cd "$stata_data/2010"
      local files:dir . files "*"
      display `files'
      foreach f of local files {
          use `f'
          elabel dir
          elabel rename (`r(names)') =_1
      }
      I don't think the loop worked because when I opened the datasets, the labels were still the same. May I get some help? Thank you.

      Comment


      • #4
        No, if you use =_1 in each file, then you end up with the same problem, and just longer label names! The key here is you have to use a different suffix in each file.
        Code:
        local files:dir . files "*"
        display `files'
        local i = 1
        foreach f of local files {
            use `f'
            elabel dir
            elabel rename (`r(names)') =_`i'
            local ++i
        }
        I don't think the loop worked because when I opened the datasets, the labels were still the same.
        Because you didn't save the files. Which brings me to the following paragraph:

        One caution about doing a mass rename of labels in files: if these files have already been used in other Stata analyses that make use of the existing label names, you are going to break those analyses by doing this. So you should be careful not to overwrite the original files when you do this. If the only reason for doing this is to create one large combined file, then you can save them as -tempfile-s and combine those. If you will actually need to save these revised files as new files so they can be used for other purposes, give them new names when you do that. Only save them under their original names if you are certain that no previous work will be broken by doing so.

        Last edited by Clyde Schechter; 02 Oct 2022, 17:56.

        Comment


        • #5
          Thank you for the feedback and clarification! I will use 'tempfile' to not overwrite the original datasets.

          Comment


          • #6
            Clyde Schechter , Hi Clyde,
            I used tempfile to save the datasets locally and tried to use the dataset (e.g. data1_1) but I got "invalid file specification".
            I ran the codes all together to avoid problem with the local macro, but cannot figure why I cannot call the dataset. Where could the problem be?

            Code:
            local files:dir . files "*"
            display `files'
            foreach f of local files {
            local i = 1
            use `f'
            elabel dir
            elabel rename (`r(names)')=_`i'
            tempfile `f'_1
            save "``f'_1'"
            local ++i
            }
            
            use "`data1_1'", clear

            Comment


            • #7
              I'm actually surprised you got this far through the code. It could have broken earlier. The problem likely arises because filenames can contain things that cannot be included in macro names. When you run -tempfile `f'_1 you are asking Stata to create a local macro whose name is the name of the current data set followed by _1. Well, the name of the data set may contains blanks, or special characters, and those would cause the code to break at the -tempfile- command. Even if all the names are relatively tame, see what happens if you have a file named results.dta:
              Code:
              . local f results.dta
              
              . tempfile `f'
              _results.dta invalid name
              r(198);
              And filenames don't really get any tamer than that. So you need to tame them more. There are some other things that are problematic, although apparently you have not been tripped up by them:
              1. It does not make sense to do -local files: dir . files "*"- because only *.dta files can be -use-d. If you have, say, some .csv or .xlsx files in the same directory (I guess in this instance you do not) those will cause -use- to break.
              2. (This error is my fault, because I didn't think of it when I wrote #4, and you just copied my code.) If any of the files has no value labels in it, `r(names)' will be an empty string after -elabel dir- runs, and then -elabel rename- will break. So, again, since you got as far as you did in the code, I guess that isn't the case in any of your files, but it is best not to rely on such assumptions when it is so easy to code around it.
              Putting this all together:
              Code:
              clear*
              
              local files:dir . files "*.dta"
              macro list _files
              
              local i = 1 // THIS ABSOLUTELY MUST BE OUTSIDE THE LOOP
              foreach f of local files {
                  use `f'
                  elabel dir
                  local all_labels `r(names)'
                  if `"`all_labels'"' != "" {
                      elabel rename (`r(names)')=_`i'
                  }
                  local ff = substr(strtoname(`"`f'"'), 1, 28)
                  tempfile `ff'_`i'
                  quietly save ``ff'_`i''
                  display `"`ff'_`i' saved"'
                  local ++i
              }
              use `chose a name from the list this code created', clear
              should work.

              The italicized -display ...- command is optional but will prove helpful. Your original idea was to make the tempfile names look like the original filenames. But, as already seen, that's not workable. What I've shown you here is about as close as you can get to that. But the transformation made by -strtoname()- is sometimes tricky to make in your own mind. And also, if you had two files whose names, after being transformed by strtoname() ended up agreeing in their first 28 characters, the second one would end up overwriting the first, and the first would be lost. By also including the number `i' in the file name, we are guaranteed that will never happen because the `i' part will always be distinctive. Anyway, the -display- command lists out the names used by Stata for each file, and so you can then copy/paste those into your subsequent -use- commands (surrounded by ` and ', of course) to refer to these files.
              Last edited by Clyde Schechter; 02 Oct 2022, 22:54.

              Comment


              • #8
                Hi Cylde,

                Thank you so much for the detailed response!
                It worked, and I realized that we need to be careful using -local- because it will create stops if we run the codes lines to lines through the do file.
                The trick with -strtoname- was proven to be very useful, thank you!

                Comment

                Working...
                X