Merging two datasets with many of the same variables

Clare McCarthy

Join Date: Sep 2019

Posts: 15
#1

Merging two datasets with many of the same variables

12 Jun 2020, 09:29

Hi,

I want to merge two Stata datasets (both with >300 variables), of which ~200+ of the variables have the same name and the same meaning. I want the variables that have the same name and meaning to combine in the merged dataset. I have gone through and identified the different variables (the ones that don't have a corresponding variable in the other dataset) and added a prefix so that I will be able to identify them after the merge. In order for the duplicate variables to combine, do they need to be in the same format? In other words, do I have to go through and reformat all the variables I would like to combine? More specifically, would a variable that is formatted " byte %10.g" combine with a variable that is formatted "int %10g?" Is it just string and numeric that won't combine?

Also, is there a fast way to output variable formats using "describe" to a Word or Excel document?? I have tried "asdoc describe" and it doesn't work (only outputs variables and variable labels).

Thanks!
Clare
Tags: None
Jeff Tree

Join Date: May 2019

Posts: 33
#2

12 Jun 2020, 09:48

Hello,

If I understand it correctly, you have two separate datasets that contain a similar variable akin to a unique ID? Further, this unique ID is formatted as byte in the first dataset, and int in the second dataset? (Providing a minimum working example would be helpful). Contingent upon your datasets it shouldn't be a problem. Here's a minimum working example:

Code:

clear input byte id int(X1 X2 X3) 1 1969 1971 1 2 1973 1980 1 3 1950 1970 1 4 1955 1970 1 5 1982 1992 1 end tempfile events save `events', replace clear input int id int(X4 X5) 1 1971 3 2 1971 4 3 1972 9 6 1973 0 7 1970 2 8 1971 0 9 1963 8 end merge m:1 id using `events', keep(match) nogenerate list

Code:

+-----------------------------------+ | id X4 X5 X1 X2 X3 | |-----------------------------------| 1. | 1 1971 3 1969 1971 1 | 2. | 2 1971 4 1973 1980 1 | 3. | 3 1972 9 1950 1970 1 | +-----------------------------------+

Assuming this is what you're trying to achieve, there is no need to manually sort through the data in advance to identify unique variables -- let Stata do the work for you! Simply use the keep() option.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#3

12 Jun 2020, 09:54

Clare McCarthy ,

for describing to Excel you could use xlist.

Merging data with different formats e.g. "%10.0f" vs "%6.4f" should not be a problem. But data types could be an issue (string vs number).

You description is vague about what does it mean to "I want the variables that have the same name and meaning to combine in the merged dataset". Stata's merge has an UPDATE option, where the second dataset is used to patch values in the first one. If this is what you need, then no need for any suffixes - Stata will do it for you. If on the other hand you need e.g. to append the strings (such as lists of countries visited, books read, courses taken) then yes, use the indices and accumulate them later after the merge.

Best, Sergiy Radyakin
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

12 Jun 2020, 13:23

It is also possible that rather than merge the datasets, you want to append them. I think the lesson from this post and the two preceding it is that you need to do a better job of explaining your data. Even the best descriptions of data are no substitute for an actual example of the data.

Be sure to use the dataex command to do this. If you are running version 15.1 or later, or a fully updated version 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

While you have hundreds of variables, the dataex command will accept a list of variable names, and will accept the if clause, so you can show a small sample of your data from each dataset, including the identifier (that you plan to merge by), two or three of the variables that are common to both datasets, and one that is not common, for a few values of the identifier, from each dataset.

When asking for help with code, always show example data. When showing example data, always use dataex.
Comment
Clare McCarthy

Join Date: Sep 2019

Posts: 15
#5

12 Jun 2020, 15:01

Hi, thank you both for your responses. Sergiy Radyakin , one question --- if I have two string variables that have the exact same name in the two different datasets, will they combine? Or do I need to manually append them?
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#6

12 Jun 2020, 15:28

I though I made it clear in the previous post:

...to append the strings (such as lists of countries visited, books read, courses taken) then ..., use the indices and accumulate them later after the merge.

The following will not give you "Sergiy Radyakin" if this is what you expected:

Code:

clear all input int pid str40 name int age 101 "Radyakin" 78 end list sort pid tempfile tmp save `tmp' clear input int pid str40 name double income 101 "Sergiy" 12345.67 end list sort pid merge pid using `"`tmp'"' list

Despite this, the files will be merged together.
Whether this counts as 'combining' for you - don't know. You don't define 'combine' and it is not a Stata term. Merge, append, join, cross are not just poetic variations of the same action, but denote different, often opposite data transformations.
1 like
Comment
Jeff Tree

Join Date: May 2019

Posts: 33
#7

15 Jun 2020, 12:58

Originally posted by Clare McCarthy View Post

Hi, thank you both for your responses. Sergiy Radyakin , one question --- if I have two string variables that have the exact same name in the two different datasets, will they combine? Or do I need to manually append them?

For matching on string variables, I would use matchit. Install via SSC,

Code:

ssc install matchit

Note that you may also need to install freqindex in order for matchit to function.

There are a couple helpful discussion posts here on how to use. See, for example, here and here.
Comment

Announcement

Merging two datasets with many of the same variables

Comment

Comment

Comment

Comment

Comment

Comment