Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge specific columns

    Hi everyone,

    I am facing a problem with merging specific columns from various stata files. My base file looks like the following:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float date str4 id
    20068 "rd5"
    20115 "rd6"
    20116 "rd7"
    20123 "rd8"
    20124 "rd8"
    20124 "rd7"
    20124 "rd5"
    end
    format %td date

    Now I want to merge that file with the two following files (excerpt):

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(date rd12 rd7 rd8)
    20068 7 5 2
    20115 2 6 3
    20116 1 7 4
    20123 6 8 5
    20124 7 9 6
    20124 7 9 6
    20124 7 9 6
    end
    format %td date
    and


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(date rd5 rd6 rd99)
    20068 5 1 9
    20115 4 4 7
    20116 3 5 6
    20123 8 2 1
    20124 3 7 2
    20124 3 7 2
    20124 3 7 2
    end
    format %td date



    What is important to me is that I only add the columns which have the same name as the observations in the varlist „id“ from the base file. The original files have more columns than Stata can handle. Hence, I need to merge only the specific columns; it is no option to merge all the columns and then to delete those that do not match.


    My goal is to have the following file (leaving out the columns for rd12 and rd99 as these are not ids in the first file):

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float date str4 id float(rd5 rd6 rd8 rd7)
    20068 "rd5" 5 1 2 5
    20115 "rd6" 4 4 3 6
    20116 "rd7" 3 5 4 7
    20123 "rd8" 8 2 5 8
    20124 "rd8" 3 7 6 9
    20124 "rd7" 3 7 6 9
    20124 "rd5" 3 7 6 9
    end
    format %td date
    I would really appreciate it if someone knew how to do this as I tried various Things which all did not work.

    Thank you very much!

    Nina
    Last edited by Nina Eastwood; 14 Sep 2017, 17:04.

  • #2
    The problem of restricting the merge to retaining just those variables that are mentioned in the variable id in the base data set is not hard to solve. But you have a much bigger problem. Your data are not suitable at all for merge-ing because you have no identifying merge key in them. The only variable to common all of these data sets is date. But the value 20124 (05feb2015) appears three times in each of them. Stata will have no way to know which of the three in the base set goes with which of the three in each of the other two data sets. So these data sets cannot be merged at all. They can be joined, which would mean pairing up each of the date = 20124 observations in base with all three observations in the first merging data set that have date = 20124 and then combining alll 9 of those combinations with each of the three observations with date = 20124 in the second merging data set, for a total of 27 observations with 20124. But I don't have the sense that is what you want. Based on your desired output, you have some way in mind of picking out which 20124 observation in each data set goes with which in the others, but it is not derivable from anything in the data. Perhaps there is some additional variable that disambiguates the date = 20124 observations that you have not shown?

    Comment


    • #3
      Thank you for your Response! I just realized that I copied the wrong sets. I edited it in the previous post and it should now make sense. In fact, I really wanted to merge the tables as you already assumed. When I tried it by myself I managed to merge the shorter versions of the tables completely, so that was possible with the data. The only problem I had was how to merge a limited number of columns.

      Comment


      • #4
        OK, so as far as I can see, the resolution you have made is that the values of the variables r5, r6, etc. in the two merge tables are actually the same in each observation that has the same date. So the solution to the problem will be to eliminate those duplicate rows. That is incorporated in the solution shown below.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float date str4 id
        20068 "rd5"
        20115 "rd6"
        20116 "rd7"
        20123 "rd8"
        20124 "rd8"
        20124 "rd7"
        20124 "rd5"
        end
        format %td date
        tempfile base
        save `base'
        
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(date rd12 rd7 rd8)
        20068 7 5 2
        20115 2 6 3
        20116 1 7 4
        20123 6 8 5
        20124 7 9 6
        20124 7 9 6
        20124 7 9 6
        end
        format %td date
        //    GET RID OF DUPLICATE OBSERVATIONS
        duplicates drop
        tempfile merge1
        save `merge1'
        
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(date rd5 rd6 rd99)
        20068 5 1 9
        20115 4 4 7
        20116 3 5 6
        20123 8 2 1
        20124 3 7 2
        20124 3 7 2
        20124 3 7 2
        end
        format %td date
        //    AGAIN, ELIMINATE DUPLICATE OBSERVATIONS
        duplicates drop
        tempfile merge2
        save `merge2'
        
        //    BRING IN THE BASE DATA AND IDENTIFY THE VARIABLES NEEDED
        use `base', clear
        levelsof id, local(needed)
        
        //    IDENTIFY ALL VARIABLES IN THE MERGE1 DATA
        des using `merge1', varlist
        local vbles `r(varlist)'
        //    AND PICK OUT THE ONES THAT ARE MENTIONED IN id IN THE BASE DATA
        local keepers: list needed & vbles
        //    MERGE, RETAINING ONLY THOSE VARIABLES
        merge m:1 date using `merge1', keepusing(`keepers') gen(merge1)
        
        //    DO THE SAME WITH THE MERGE2 DATA SET
        des using `merge2', varlist
        local vbles `r(varlist)'
        local keepers: list needed & vbles
        merge m:1 date using `merge2', keepusing(`keepers') gen(merge2)
        order merge*, last
        Note: The rd* variables end up in a different order than you have them in your example. If that matters for what you will be doing, you can always reorder them using the -order- command.

        Comment


        • #5
          Thank you, Clyde this is very helpful. Assuming I have a third file which does not produce any matching values for the "keepers" macro, Stata would then merge the entire columns right? Is there a way to state that no merge will take place at all?
          I am asking because I might work with further tables which do not necessarily have columns for the ids. An exmaple would be the following table:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(date rd14 rd45 rd72)"
          20068 9 3 8
          20115 7 6 0
          20116 2 1 6
          20123 5 3 5
          20124 4 1 2
          20124 4 1 2
          20124 4 1 2
          end
          format %td date
          As there is none of the ids from the base case having its own column there should also be no merge. I hope it is understandable what I mean
          Thank you!

          Comment


          • #6
            Yes, very clear. So the problem arises because when Stata sees -keepusing()- with an empty argument, it interprets that as meaning "all variables in the data set." That is in fact the way an empty varlist is interpreted in most contexts. I had not anticipated that you would have files with no variables to be retained.

            So the trick is to make sure that `keepers' is never empty. A sure-fire way to do that is to include the date variable, because as the merge key it is part of every data set, and it is also never unwanted. So by adding it to the `needed' list, we can always be sure that `keepers' will contain, if nothing else, date.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float date str4 id
            20068 "rd5"
            20115 "rd6"
            20116 "rd7"
            20123 "rd8"
            20124 "rd8"
            20124 "rd7"
            20124 "rd5"
            end
            format %td date
            tempfile base
            save `base'
            
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float(date rd12 rd7 rd8)
            20068 7 5 2
            20115 2 6 3
            20116 1 7 4
            20123 6 8 5
            20124 7 9 6
            20124 7 9 6
            20124 7 9 6
            end
            format %td date
            //    GET RID OF DUPLICATE OBSERVATIONS
            duplicates drop
            tempfile merge1
            save `merge1'
            
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float(date rd5 rd6 rd99)
            20068 5 1 9
            20115 4 4 7
            20116 3 5 6
            20123 8 2 1
            20124 3 7 2
            20124 3 7 2
            20124 3 7 2
            end
            format %td date
            //    AGAIN, ELIMINATE DUPLICATE OBSERVATIONS
            duplicates drop
            tempfile merge2
            save `merge2'
            
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input float(date rd14 rd45 rd72)
            20068 9 3 8
            20115 7 6 0
            20116 2 1 6
            20123 5 3 5
            20124 4 1 2
            20124 4 1 2
            20124 4 1 2
            end
            format %td date
            duplicates drop
            tempfile merge3
            save `merge3'
            
            //    BRING IN THE BASE DATA AND IDENTIFY THE VARIABLES NEEDED
            use `base', clear
            levelsof id, local(needed)
            local needed date `needed'
            
            //    IDENTIFY ALL VARIABLES IN THE MERGE1 DATA
            des using `merge1', varlist
            local vbles `r(varlist)'
            //    AND PICK OUT THE ONES THAT ARE MENTIONED IN id IN THE BASE DATA
            local keepers: list needed & vbles
            //    MERGE, RETAINING ONLY THOSE VARIABLES
            merge m:1 date using `merge1', keepusing(`keepers') gen(merge1)
            
            //    DO THE SAME WITH THE MERGE2 DATA SET
            des using `merge2', varlist
            local vbles `r(varlist)'
            local keepers: list needed & vbles
            merge m:1 date using `merge2', keepusing(`keepers') gen(merge2)
            
            //    AND WITH MERGE3
            des using `merge3', varlist
            local vbles `r(varlist)'
            local keepers: list needed & vbles
            merge m:1 date using `merge3', keepusing(`keepers') gen(merge3)
            
            order merge*, last

            Comment

            Working...
            X