Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge different variables with same variable name

    Halo everybody,

    I'm trying to merge varios datasets using merge command. Datasets contain company-informations for year x. Unfortunately despite beeing sets of different years the names of variable are the same among datasets. E.g. EBIT, Margin_x etc.

    What I'm loooking for is a way add the new data as a new variable without the overwriting the existing one, while ignoring the same fact, that both have the same name. Renaming of the variables manually is unfortunately not an option. Example:

    Dataset a (1999):
    company | EBIT
    A | 123
    B | 122
    C | .

    Dataset b (2000):
    company | EBIT
    A | 453
    C | 980
    D | 486

    Goal:
    company | EBIT_1 | EBIT_2
    A | 123 | 453
    B | 122 | .
    C | . | 980
    D | . | 486

    As you can see some values may be missing. In the merge's help file there is a way to update the file but I couldn't find a way to simply out the data in a new variable.

    Plese help


  • #2
    you appear to want your final data set in wide format - why? in general, this is not a good idea

    however, if you really want it that way I would do in three steps: (1) add a variable giving the year to each data set; (2) append the files (not merge them); (3) use the reshape command to change the structure; of course, if you actually want the data in long form, stop after step 2

    Comment


    • #3
      This solution does not work for me!

      I have two datasets of respondents of two members of households. Most of the variables in each dataset are unique, however some variables have exact same name (e.g. start time, end time, father name). I want to merge them in a way that the duplicate variable names take a suffix (e.g. start_x and start_y), but the rest of the variables names that are unique stay the same.

      Comment


      • #4
        If you read the Forum FAQ, you will learn, among other things, that saying that something "does not work" is pretty useless. There are thousands of ways in which code can fail to do what you expect of it (some relating to the code, and some to your expectations), and it is impossible to guess which problem you are facing.

        The description you give in your second paragraph contains nothing that would be a barrier to the solution in #2 working with your data. So if you want help, I suggest you post back showing the exact code that you used to try to implement the solution in #2, and examples of the data in the two data sets you are working with. And, if it is not very obvious, also explain why the results you get are not what you want.

        To show the example data, be sure to use the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

        When asking for help with code, always show example data. When showing example data, always use -dataex-.

        Comment


        • #5
          Thanks Clyde for the note. I am going to explain my situation using a sample data.

          I have two datasets data1 and data2. There are three type of variables in each dataset: 1) an ID variable that matches in both dataset, 2) an income variable in both dataset but with different content which should be kept, and 3) the rest of variables.

          My question is whether we can merge the datasets with ID variable, and have variables with similar names kept by assigning suffix to their names. Here is a reproducible example:

          Code:
          clear
          input float income str4 id str1 sex
          2200 "hh1" "m"
          3000 "hh2" "f"
          3900 "hh3" "m"
          3500 "hh4" "m"
          2900 "hh5" "f"
          end
          save "~/data1.dta", replace
          
          clear
          input float income str4 id float q1
          4000 "hh1" 1
          3400 "hh2" 2
          4100 "hh3" 1
          2100 "hh4" 1
          2700 "hh5" 2
          end
          save "~/data2.dta", replace
          
          merge 1:1 id using "~/data1.dta"
          After merging, the variable income from the using dataset is not kept as there is a variable with the same name in the master dataset. What I want is to have keep both variables e.g. income_x and income_y (something that r does easily).

          Thanks.

          Comment


          • #6
            You can quickly rename all vars in data2 (and data1 if you prefer):

            Code:
            use "~/data2.dta", clear
            *rename all vars in data2
            foreach var of varlist _all {
                rename `var' `var'_2
                }
            *change id_2 back to id
            rename id_2 id
            
            merge 1:1 id using "~/data1.dta"
            Stata/MP 14.1 (64-bit x86-64)
            Revision 19 May 2016
            Win 8.1

            Comment


            • #7
              Following in Carole's path, you can use macro extended list functions and the describe command to limit the renaming to the variables that are duplicated.
              Code:
              quietly describe, varlist
              local var2 `r(varlist)'
              quietly describe using "~/data1.dta", varlist
              local var1 `r(varlist)'
              
              local same : list var2 & var1
              local idlist id
              local same : list same - idlist
              display "same name in both datasets: `same'"
              rename (`same') (=_2)
              
              merge 1:1 id using "~/data1.dta"
              rename (`same') (=_1)
              list, clean
              Code:
              . list, clean
              
                     income_2    id   q1   income_1   sex        _merge  
                1.       4000   hh1    1       2200     m   matched (3)  
                2.       3400   hh2    2       3000     f   matched (3)  
                3.       4100   hh3    1       3900     m   matched (3)  
                4.       2100   hh4    1       3500     m   matched (3)  
                5.       2700   hh5    2       2900     f   matched (3)

              Comment


              • #8
                Thanks William, this is exactly what I was after. There is a typo if you want to fix in the local same : list var2 & var1

                Comment


                • #9
                  Thank you for pointing that out for the benefit of those who may find this topic in the future, The forum software hung on my first attempt to post, and when I let it recover what I had tried to post, it recovered my ampersand & as the html-escaped sequence we see, without my having noticed the substitution.

                  Comment


                  • #10
                    I have somehow opposite problem to these. Survey was done in 2003 and collected 500 sample. Another survey was done in 2010 and collected 600 sample for the same household and some new. Now I have two sample files for 2003 and 2010. However, the name for each variable is different even if they capture same thing. For example, household number in 2003 is named as 'HH' while it was coded as 'v00_hno' for 2010. ID and all other variables are like similar, not following any specific pattern. In such scenario how can I append, to prepare workable panel data. How can I let Stata know these two different variable name is same while appending?

                    My data set for 2003 is like this

                    input int WWW byte HH int V00_SERL
                    3 1 10
                    3 6 63
                    3 11 191
                    3 12 127
                    3 14 148
                    3 17 180
                    3 18 201
                    4 1 3
                    4 4 13
                    4 5 16
                    4 7 23
                    4 8 26
                    4 10 33

                    My data set for 2010 is like

                    input int xhpsu byte(xhnum v00_hno)
                    3 1 1
                    3 6 6
                    3 11 11
                    3 12 12
                    3 14 14
                    3 17 17
                    3 18 11
                    3 19 19
                    3 21 21
                    17 1 1
                    17 2 2
                    17 3 3
                    17 4 4
                    17 7 7
                    17 8 8
                    17 10 10

                    How can I get something like

                    input int xhpsu byte(xhnum v00_hno)
                    3 1 1
                    3 1 10
                    3 6 6
                    3 6 63
                    3 11 11
                    3 11 191
                    3 12 12
                    3 12 127
                    3 14 14
                    3 14 148
                    3 17 17
                    3 17 180
                    3 18 11
                    3 18 201
                    3 19 19
                    3 21 21
                    4 1 3
                    4 4 13
                    4 5 16
                    4 7 23
                    4 8 26
                    4 10 33
                    17 1 1
                    17 2 2
                    17 3 3
                    17 4 4
                    17 7 7
                    17 8 8
                    17 10 10

                    Last edited by Ishwor Adhikari; 07 Mar 2020, 04:34.

                    Comment


                    • #11
                      Code:
                      clear*
                      input int xhpsu byte(xhnum v00_hno)
                      3 1 1
                      3 6 6
                      3 11 11
                      3 12 12
                      3 14 14
                      3 17 17
                      3 18 11
                      3 19 19
                      3 21 21
                      17 1 1
                      17 2 2
                      17 3 3
                      17 4 4
                      17 7 7
                      17 8 8
                      17 10 10
                      end
                      tempfile 2010data
                      save `2010data'
                      
                      clear
                      input int WWW byte HH int V00_SERL
                      3 1 10
                      3 6 63
                      3 11 191
                      3 12 127
                      3 14 148
                      3 17 180
                      3 18 201
                      4 1 3
                      4 4 13
                      4 5 16
                      4 7 23
                      4 8 26
                      4 10 33
                      end
                      tempfile 2003data
                      save `2003data'
                      
                      use `2003data', clear
                      unab 2003varnames: _all
                      use `2010data', clear
                      unab 2010varnames: _all
                      rename (`2010varnames') (`2003varnames')
                      append using `2003data'
                      NOTE: This assumes that the variables that mean the same thing are in the same order in the two data sets (as is the case in your example.) If they are not, then it won't be this easy.

                      Added: Oh, on re-reading more carefully, I see that the variables are not in the same order in both data sets. The household id HH appears second in 2003 but its corresponding v00_hno is third in the 2010 data. So the above code produces data salad. Sorry.

                      So, in this case it's more complicated. And since the changes in the names is unsystematic, it's going to be tedious. You have stated you want to use the 2010 variable names. So you need to compose a local macro that gives a crosswalk between the 2003 names and the corresponding 2010 names. The simplest approach to this is to just have that crosswalk list the 2010 names in the order that those variables appear in the 2003 data. So it would look like this:

                      Last edited by Clyde Schechter; 07 Mar 2020, 10:50.

                      Comment


                      • #12
                        Code:
                        local crosswalk xhpsu v00_hno xhnum
                        use `2003data', clear
                        rename (_all) (`crosswalk')
                        gen wave = 2003
                        append using `2010data'
                        replace wave = 2010 if missing(wave)
                        I imagine you have more than just three variables. It will be tedious and error-prone to create the crosswalk for all of the variables, but in the absence of any systematic relationship between old and new variable name, there is no alternative.

                        Comment


                        • #13
                          Thank you so much Clyde Schechter. Yeah given there is no systematic relationship between, even not the order, it seems creating crosswalk seems better option.

                          Comment

                          Working...
                          X