How to merge two datasets and How to use keep

Edna Zapata

Join Date: Oct 2021

Posts: 1
#1

How to merge two datasets and How to use keep

17 Oct 2021, 14:35

Hello everyone!

I'm a very beginner using STATA so please I'd like to get some help with merge command,

My purpose is to add variables to the master dataset and keep only what is matched related to a particular observation, so I've been told I must use merge command (I'm using merge in STATA 17). My master dataset has more observations than my using dataset, but they both have in common two variables householdnumber and household_number_member. I was suggested to use all common variables when merging, not just one in common.

However, in the master dataset the householdnumber variable has repeated observations. In other words, appears multiple times the same household number referring to each member of the household, who will be identified by household_number_member variable. Thus, household_number_member is identifed as 1= marriage immigrant, 2= national spouse, 3= children, etc. On the other hand, the using dataset doesn't have repeated observations in the variable householdnumber, because this dataset only uses in the household_number_member variable the observation 1 (which means, 1= marriage immigrant) that's why it has less observations.

By using merge, I'm only interested in keeping all related to 1 (which means 1=marriage immigrant) and adding variables that appear in the using dataset.

To make myself clear, and to get proper help. I'm using dataex command as follows (the following code corresponds to my master dataset)

* Example generated by -dataex-. For more info, type help dataex
clear
input long householdnumber byte household_number_member
10000028 1
10000028 2
10000031 1
10000031 2
10000062 1
10000068 1
10000068 2
10000124 1
10000124 2
10000140 1
10000140 2
10000186 1
10000187 1
10000187 2
10000213 1
10000232 1
10000232 3
10000270 1
10000270 2
10000292 1
10000293 1
10000293 2
10000315 1
10000315 2
10000328 1
10000328 2
10000344 1
10000393 1
10000404 1
10000404 2
10000415 1
10000415 2
10000419 1
10000419 2
10000422 1
10000422 2
10000455 1
10000455 2
10000492 1
10000492 2
10000497 1
10000497 2
10000533 1
10000533 2
10000549 1
10000556 1
10000556 2
10000556 3
10000556 4
10000556 21
10000592 1
10000592 2
10000592 3
10000592 21
10000601 1
10000601 21
10000608 1
10000608 2
10000615 1
10000658 1
10000658 2
10000700 1
10000700 2
10000700 3
10000712 1
10000712 2
10000712 3
10000713 1
10000713 2
10000764 1
10000771 1
10000771 2
10000774 1
10000774 2
10000787 1
10000855 1
10000855 2
10000894 1
10000894 2
10000894 3
10000902 1
10000902 2
10000952 1
10000952 2
10000952 21
10000970 1
10000970 2
10001003 1
10001003 2
10001004 1
10001004 2
10001069 1
10001069 2
10001142 1
10001193 1
10001193 2
10001193 3
10001193 4
10001200 1
10001200 2
end

The following code corresponds to my using dataset:

* Example generated by -dataex-. For more info, type help dataex
clear
input long householdnumber byte household_number_member
10000028 1
10000031 1
10000062 1
10000068 1
10000124 1
10000140 1
10000186 1
10000187 1
10000213 1
10000232 1
10000270 1
10000292 1
10000293 1
10000315 1
10000328 1
10000344 1
10000393 1
10000404 1
10000415 1
10000419 1
10000422 1
10000455 1
10000492 1
10000497 1
10000533 1
10000549 1
10000556 1
10000592 1
10000601 1
10000608 1
10000658 1
10000700 1
10000712 1
10000713 1
10000764 1
10000771 1
10000774 1
10000787 1
10000855 1
10000894 1
10000902 1
10000952 1
10000970 1
10001003 1
10001004 1
10001069 1
10001142 1
10001193 1
10001200 1
10001214 1
10001292 1
10001332 1
10001334 1
10001353 1
10001425 1
10001426 1
10001434 1
10001445 1
10001482 1
10001523 1
10001609 1
10001690 1
10001720 1
10001775 1
10001782 1
10001827 1
10001865 1
10001883 1
10001924 1
10001959 1
10001969 1
10002006 1
10002084 1
10002150 1
10002174 1
10002193 1
10002227 1
10002311 1
10002325 1
10002403 1
10002409 1
10002437 1
10002498 1
10002511 1
10002568 1
10002580 1
10002592 1
10002595 1
10002600 1
10002620 1
10002714 1
10002730 1
10002732 1
10002749 1
10002816 1
10002840 1
10002870 1
10002941 1
10002970 1
10003085 1
end

I've been using the command as follows, but STATA issued an error message saying: factor-variable and time-series operators not allowed

merge 1:1 householdnumber household_number_member "filename", keep if household_member_number==1

Any help would be very much appreciated.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29956

17 Oct 2021, 15:07

Your description of what you want to end up with isn't entirely clear. The following code will leave you with a data set containing all observations from the master data set and, for those with household_member_number == 1 that also have a match in the using data set, their information from using.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long householdnumber byte household_number_member
10000028 1
10000028 2
10000031 1
10000031 2
10000062 1
10000068 1
10000068 2
10000124 1
10000124 2
10000140 1
10000140 2
10000186 1
10000187 1
10000187 2
10000213 1
10000232 1
10000232 3
10000270 1
10000270 2
10000292 1
10000293 1
10000293 2
10000315 1
10000315 2
10000328 1
10000328 2
10000344 1
10000393 1
10000404 1
10000404 2
10000415 1
10000415 2
10000419 1
10000419 2
10000422 1
10000422 2
10000455 1
10000455 2
10000492 1
10000492 2
10000497 1
10000497 2
10000533 1
10000533 2
10000549 1
10000556 1
10000556 2
10000556 3
10000556 4
10000556 21
10000592 1
10000592 2
10000592 3
10000592 21
10000601 1
10000601 21
10000608 1
10000608 2
10000615 1
10000658 1
10000658 2
10000700 1
10000700 2
10000700 3
10000712 1
10000712 2
10000712 3
10000713 1
10000713 2
10000764 1
10000771 1
10000771 2
10000774 1
10000774 2
10000787 1
10000855 1
10000855 2
10000894 1
10000894 2
10000894 3
10000902 1
10000902 2
10000952 1
10000952 2
10000952 21
10000970 1
10000970 2
10001003 1
10001003 2
10001004 1
10001004 2
10001069 1
10001069 2
10001142 1
10001193 1
10001193 2
10001193 3
10001193 4
10001200 1
10001200 2
end
tempfile master
save `master'


* Example generated by -dataex-. For more info, type help dataex
clear
input long householdnumber byte household_number_member
10000028 1
10000031 1
10000062 1
10000068 1
10000124 1
10000140 1
10000186 1
10000187 1
10000213 1
10000232 1
10000270 1
10000292 1
10000293 1
10000315 1
10000328 1
10000344 1
10000393 1
10000404 1
10000415 1
10000419 1
10000422 1
10000455 1
10000492 1
10000497 1
10000533 1
10000549 1
10000556 1
10000592 1
10000601 1
10000608 1
10000658 1
10000700 1
10000712 1
10000713 1
10000764 1
10000771 1
10000774 1
10000787 1
10000855 1
10000894 1
10000902 1
10000952 1
10000970 1
10001003 1
10001004 1
10001069 1
10001142 1
10001193 1
10001200 1
10001214 1
10001292 1
10001332 1
10001334 1
10001353 1
10001425 1
10001426 1
10001434 1
10001445 1
10001482 1
10001523 1
10001609 1
10001690 1
10001720 1
10001775 1
10001782 1
10001827 1
10001865 1
10001883 1
10001924 1
10001959 1
10001969 1
10002006 1
10002084 1
10002150 1
10002174 1
10002193 1
10002227 1
10002311 1
10002325 1
10002403 1
10002409 1
10002437 1
10002498 1
10002511 1
10002568 1
10002580 1
10002592 1
10002595 1
10002600 1
10002620 1
10002714 1
10002730 1
10002732 1
10002749 1
10002816 1
10002840 1
10002870 1
10002941 1
10002970 1
10003085 1
end
tempfile using
save `using'

use `master', clear
merge 1:1 householdnumber household_number_member using `using', keep(match master)

However, perhaps you don't want to keep those from the master data set who have no matching observation in the using data set (which entails also losing those in the master data set with household_member_number != 1). In that case, replace -keep(match master)- with -keep(match)-. Then you will be left only with those where household_member_number == 1 and a match in the using data set.

Note that the keyword using must appear in the -merge- command before the name of the using data set, even if the using data set itself is called using.

Note that the -merge- command does not allow -if- qualifiers. In your case, since the using data set has household_number_member == 1 in all of its observations, no such specification is needed anyway-- only observations in the master data with household_number_member == 1 will find a match. If it were not the case that household_number_member == 1 in all observations of the using data set, you could impose that restriction in a separate command after the -merge-, as -keep if household_number_member == 1-.

Thank you for using -dataex- on your very first post!

Added: And if you were in a situation where the using data set didn't mention household_number_member (which would, often as not, be true as it is common to eliminate variables that only take on a single value. Then you would code -merge m:1 household_number using `using', keep(match master)-, or perhaps with -keep(match)- if you only wanted matching observations.

Last edited by Clyde Schechter; 17 Oct 2021, 15:11.

Announcement

How to merge two datasets and How to use keep

Comment