Merge specific columns

Nina Eastwood

Join Date: Sep 2017

Posts: 15
#1

Merge specific columns

14 Sep 2017, 16:06

Hi everyone,

I am facing a problem with merging specific columns from various stata files. My base file looks like the following:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float date str4 id 20068 "rd5" 20115 "rd6" 20116 "rd7" 20123 "rd8" 20124 "rd8" 20124 "rd7" 20124 "rd5" end format %td date

Now I want to merge that file with the two following files (excerpt):

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(date rd12 rd7 rd8) 20068 7 5 2 20115 2 6 3 20116 1 7 4 20123 6 8 5 20124 7 9 6 20124 7 9 6 20124 7 9 6 end format %td date

and

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(date rd5 rd6 rd99) 20068 5 1 9 20115 4 4 7 20116 3 5 6 20123 8 2 1 20124 3 7 2 20124 3 7 2 20124 3 7 2 end format %td date

What is important to me is that I only add the columns which have the same name as the observations in the varlist „id“ from the base file. The original files have more columns than Stata can handle. Hence, I need to merge only the specific columns; it is no option to merge all the columns and then to delete those that do not match.

My goal is to have the following file (leaving out the columns for rd12 and rd99 as these are not ids in the first file):

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float date str4 id float(rd5 rd6 rd8 rd7) 20068 "rd5" 5 1 2 5 20115 "rd6" 4 4 3 6 20116 "rd7" 3 5 4 7 20123 "rd8" 8 2 5 8 20124 "rd8" 3 7 6 9 20124 "rd7" 3 7 6 9 20124 "rd5" 3 7 6 9 end format %td date

I would really appreciate it if someone knew how to do this as I tried various Things which all did not work.

Thank you very much!

Nina

Last edited by Nina Eastwood; 14 Sep 2017, 17:04.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

14 Sep 2017, 16:20

The problem of restricting the merge to retaining just those variables that are mentioned in the variable id in the base data set is not hard to solve. But you have a much bigger problem. Your data are not suitable at all for merge-ing because you have no identifying merge key in them. The only variable to common all of these data sets is date. But the value 20124 (05feb2015) appears three times in each of them. Stata will have no way to know which of the three in the base set goes with which of the three in each of the other two data sets. So these data sets cannot be merged at all. They can be joined, which would mean pairing up each of the date = 20124 observations in base with all three observations in the first merging data set that have date = 20124 and then combining alll 9 of those combinations with each of the three observations with date = 20124 in the second merging data set, for a total of 27 observations with 20124. But I don't have the sense that is what you want. Based on your desired output, you have some way in mind of picking out which 20124 observation in each data set goes with which in the others, but it is not derivable from anything in the data. Perhaps there is some additional variable that disambiguates the date = 20124 observations that you have not shown?
Comment
Nina Eastwood

Join Date: Sep 2017

Posts: 15
#3

14 Sep 2017, 17:12

Thank you for your Response! I just realized that I copied the wrong sets. I edited it in the previous post and it should now make sense. In fact, I really wanted to merge the tables as you already assumed. When I tried it by myself I managed to merge the shorter versions of the tables completely, so that was possible with the data. The only problem I had was how to merge a limited number of columns.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

14 Sep 2017, 17:28

OK, so as far as I can see, the resolution you have made is that the values of the variables r5, r6, etc. in the two merge tables are actually the same in each observation that has the same date. So the solution to the problem will be to eliminate those duplicate rows. That is incorporated in the solution shown below.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float date str4 id
20068 "rd5"
20115 "rd6"
20116 "rd7"
20123 "rd8"
20124 "rd8"
20124 "rd7"
20124 "rd5"
end
format %td date
tempfile base
save `base'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(date rd12 rd7 rd8)
20068 7 5 2
20115 2 6 3
20116 1 7 4
20123 6 8 5
20124 7 9 6
20124 7 9 6
20124 7 9 6
end
format %td date
//    GET RID OF DUPLICATE OBSERVATIONS
duplicates drop
tempfile merge1
save `merge1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(date rd5 rd6 rd99)
20068 5 1 9
20115 4 4 7
20116 3 5 6
20123 8 2 1
20124 3 7 2
20124 3 7 2
20124 3 7 2
end
format %td date
//    AGAIN, ELIMINATE DUPLICATE OBSERVATIONS
duplicates drop
tempfile merge2
save `merge2'

//    BRING IN THE BASE DATA AND IDENTIFY THE VARIABLES NEEDED
use `base', clear
levelsof id, local(needed)

//    IDENTIFY ALL VARIABLES IN THE MERGE1 DATA
des using `merge1', varlist
local vbles `r(varlist)'
//    AND PICK OUT THE ONES THAT ARE MENTIONED IN id IN THE BASE DATA
local keepers: list needed & vbles
//    MERGE, RETAINING ONLY THOSE VARIABLES
merge m:1 date using `merge1', keepusing(`keepers') gen(merge1)

//    DO THE SAME WITH THE MERGE2 DATA SET
des using `merge2', varlist
local vbles `r(varlist)'
local keepers: list needed & vbles
merge m:1 date using `merge2', keepusing(`keepers') gen(merge2)
order merge*, last

Note: The rd* variables end up in a different order than you have them in your example. If that matters for what you will be doing, you can always reorder them using the -order- command.

Comment

Nina Eastwood

Join Date: Sep 2017

Posts: 15
#5

14 Sep 2017, 19:17

Thank you, Clyde this is very helpful. Assuming I have a third file which does not produce any matching values for the "keepers" macro, Stata would then merge the entire columns right? Is there a way to state that no merge will take place at all?
I am asking because I might work with further tables which do not necessarily have columns for the ids. An exmaple would be the following table:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(date rd14 rd45 rd72)" 20068 9 3 8 20115 7 6 0 20116 2 1 6 20123 5 3 5 20124 4 1 2 20124 4 1 2 20124 4 1 2 end format %td date

As there is none of the ids from the base case having its own column there should also be no merge. I hope it is understandable what I mean
Thank you!
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

14 Sep 2017, 19:59

Yes, very clear. So the problem arises because when Stata sees -keepusing()- with an empty argument, it interprets that as meaning "all variables in the data set." That is in fact the way an empty varlist is interpreted in most contexts. I had not anticipated that you would have files with no variables to be retained.

So the trick is to make sure that `keepers' is never empty. A sure-fire way to do that is to include the date variable, because as the merge key it is part of every data set, and it is also never unwanted. So by adding it to the `needed' list, we can always be sure that `keepers' will contain, if nothing else, date.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float date str4 id
20068 "rd5"
20115 "rd6"
20116 "rd7"
20123 "rd8"
20124 "rd8"
20124 "rd7"
20124 "rd5"
end
format %td date
tempfile base
save `base'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(date rd12 rd7 rd8)
20068 7 5 2
20115 2 6 3
20116 1 7 4
20123 6 8 5
20124 7 9 6
20124 7 9 6
20124 7 9 6
end
format %td date
//    GET RID OF DUPLICATE OBSERVATIONS
duplicates drop
tempfile merge1
save `merge1'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(date rd5 rd6 rd99)
20068 5 1 9
20115 4 4 7
20116 3 5 6
20123 8 2 1
20124 3 7 2
20124 3 7 2
20124 3 7 2
end
format %td date
//    AGAIN, ELIMINATE DUPLICATE OBSERVATIONS
duplicates drop
tempfile merge2
save `merge2'

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(date rd14 rd45 rd72)
20068 9 3 8
20115 7 6 0
20116 2 1 6
20123 5 3 5
20124 4 1 2
20124 4 1 2
20124 4 1 2
end
format %td date
duplicates drop
tempfile merge3
save `merge3'

//    BRING IN THE BASE DATA AND IDENTIFY THE VARIABLES NEEDED
use `base', clear
levelsof id, local(needed)
local needed date `needed'

//    IDENTIFY ALL VARIABLES IN THE MERGE1 DATA
des using `merge1', varlist
local vbles `r(varlist)'
//    AND PICK OUT THE ONES THAT ARE MENTIONED IN id IN THE BASE DATA
local keepers: list needed & vbles
//    MERGE, RETAINING ONLY THOSE VARIABLES
merge m:1 date using `merge1', keepusing(`keepers') gen(merge1)

//    DO THE SAME WITH THE MERGE2 DATA SET
des using `merge2', varlist
local vbles `r(varlist)'
local keepers: list needed & vbles
merge m:1 date using `merge2', keepusing(`keepers') gen(merge2)

//    AND WITH MERGE3
des using `merge3', varlist
local vbles `r(varlist)'
local keepers: list needed & vbles
merge m:1 date using `merge3', keepusing(`keepers') gen(merge3)

order merge*, last

Announcement