Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to merge two datasets and How to use keep

    Hello everyone!

    I'm a very beginner using STATA so please I'd like to get some help with merge command,

    My purpose is to add variables to the master dataset and keep only what is matched related to a particular observation, so I've been told I must use merge command (I'm using merge in STATA 17). My master dataset has more observations than my using dataset, but they both have in common two variables householdnumber and household_number_member. I was suggested to use all common variables when merging, not just one in common.

    However, in the master dataset the householdnumber variable has repeated observations. In other words, appears multiple times the same household number referring to each member of the household, who will be identified by household_number_member variable. Thus, household_number_member is identifed as 1= marriage immigrant, 2= national spouse, 3= children, etc. On the other hand, the using dataset doesn't have repeated observations in the variable householdnumber, because this dataset only uses in the household_number_member variable the observation 1 (which means, 1= marriage immigrant) that's why it has less observations.

    By using merge, I'm only interested in keeping all related to 1 (which means 1=marriage immigrant) and adding variables that appear in the using dataset.

    To make myself clear, and to get proper help. I'm using dataex command as follows (the following code corresponds to my master dataset)

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long householdnumber byte household_number_member
    10000028 1
    10000028 2
    10000031 1
    10000031 2
    10000062 1
    10000068 1
    10000068 2
    10000124 1
    10000124 2
    10000140 1
    10000140 2
    10000186 1
    10000187 1
    10000187 2
    10000213 1
    10000232 1
    10000232 3
    10000270 1
    10000270 2
    10000292 1
    10000293 1
    10000293 2
    10000315 1
    10000315 2
    10000328 1
    10000328 2
    10000344 1
    10000393 1
    10000404 1
    10000404 2
    10000415 1
    10000415 2
    10000419 1
    10000419 2
    10000422 1
    10000422 2
    10000455 1
    10000455 2
    10000492 1
    10000492 2
    10000497 1
    10000497 2
    10000533 1
    10000533 2
    10000549 1
    10000556 1
    10000556 2
    10000556 3
    10000556 4
    10000556 21
    10000592 1
    10000592 2
    10000592 3
    10000592 21
    10000601 1
    10000601 21
    10000608 1
    10000608 2
    10000615 1
    10000658 1
    10000658 2
    10000700 1
    10000700 2
    10000700 3
    10000712 1
    10000712 2
    10000712 3
    10000713 1
    10000713 2
    10000764 1
    10000771 1
    10000771 2
    10000774 1
    10000774 2
    10000787 1
    10000855 1
    10000855 2
    10000894 1
    10000894 2
    10000894 3
    10000902 1
    10000902 2
    10000952 1
    10000952 2
    10000952 21
    10000970 1
    10000970 2
    10001003 1
    10001003 2
    10001004 1
    10001004 2
    10001069 1
    10001069 2
    10001142 1
    10001193 1
    10001193 2
    10001193 3
    10001193 4
    10001200 1
    10001200 2
    end


    The following code corresponds to my using dataset:

    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long householdnumber byte household_number_member
    10000028 1
    10000031 1
    10000062 1
    10000068 1
    10000124 1
    10000140 1
    10000186 1
    10000187 1
    10000213 1
    10000232 1
    10000270 1
    10000292 1
    10000293 1
    10000315 1
    10000328 1
    10000344 1
    10000393 1
    10000404 1
    10000415 1
    10000419 1
    10000422 1
    10000455 1
    10000492 1
    10000497 1
    10000533 1
    10000549 1
    10000556 1
    10000592 1
    10000601 1
    10000608 1
    10000658 1
    10000700 1
    10000712 1
    10000713 1
    10000764 1
    10000771 1
    10000774 1
    10000787 1
    10000855 1
    10000894 1
    10000902 1
    10000952 1
    10000970 1
    10001003 1
    10001004 1
    10001069 1
    10001142 1
    10001193 1
    10001200 1
    10001214 1
    10001292 1
    10001332 1
    10001334 1
    10001353 1
    10001425 1
    10001426 1
    10001434 1
    10001445 1
    10001482 1
    10001523 1
    10001609 1
    10001690 1
    10001720 1
    10001775 1
    10001782 1
    10001827 1
    10001865 1
    10001883 1
    10001924 1
    10001959 1
    10001969 1
    10002006 1
    10002084 1
    10002150 1
    10002174 1
    10002193 1
    10002227 1
    10002311 1
    10002325 1
    10002403 1
    10002409 1
    10002437 1
    10002498 1
    10002511 1
    10002568 1
    10002580 1
    10002592 1
    10002595 1
    10002600 1
    10002620 1
    10002714 1
    10002730 1
    10002732 1
    10002749 1
    10002816 1
    10002840 1
    10002870 1
    10002941 1
    10002970 1
    10003085 1
    end

    I've been using the command as follows, but STATA issued an error message saying: factor-variable and time-series operators not allowed

    merge 1:1 householdnumber household_number_member "filename", keep if household_member_number==1


    Any help would be very much appreciated.

  • #2
    Your description of what you want to end up with isn't entirely clear. The following code will leave you with a data set containing all observations from the master data set and, for those with household_member_number == 1 that also have a match in the using data set, their information from using.
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long householdnumber byte household_number_member
    10000028 1
    10000028 2
    10000031 1
    10000031 2
    10000062 1
    10000068 1
    10000068 2
    10000124 1
    10000124 2
    10000140 1
    10000140 2
    10000186 1
    10000187 1
    10000187 2
    10000213 1
    10000232 1
    10000232 3
    10000270 1
    10000270 2
    10000292 1
    10000293 1
    10000293 2
    10000315 1
    10000315 2
    10000328 1
    10000328 2
    10000344 1
    10000393 1
    10000404 1
    10000404 2
    10000415 1
    10000415 2
    10000419 1
    10000419 2
    10000422 1
    10000422 2
    10000455 1
    10000455 2
    10000492 1
    10000492 2
    10000497 1
    10000497 2
    10000533 1
    10000533 2
    10000549 1
    10000556 1
    10000556 2
    10000556 3
    10000556 4
    10000556 21
    10000592 1
    10000592 2
    10000592 3
    10000592 21
    10000601 1
    10000601 21
    10000608 1
    10000608 2
    10000615 1
    10000658 1
    10000658 2
    10000700 1
    10000700 2
    10000700 3
    10000712 1
    10000712 2
    10000712 3
    10000713 1
    10000713 2
    10000764 1
    10000771 1
    10000771 2
    10000774 1
    10000774 2
    10000787 1
    10000855 1
    10000855 2
    10000894 1
    10000894 2
    10000894 3
    10000902 1
    10000902 2
    10000952 1
    10000952 2
    10000952 21
    10000970 1
    10000970 2
    10001003 1
    10001003 2
    10001004 1
    10001004 2
    10001069 1
    10001069 2
    10001142 1
    10001193 1
    10001193 2
    10001193 3
    10001193 4
    10001200 1
    10001200 2
    end
    tempfile master
    save `master'
    
    
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long householdnumber byte household_number_member
    10000028 1
    10000031 1
    10000062 1
    10000068 1
    10000124 1
    10000140 1
    10000186 1
    10000187 1
    10000213 1
    10000232 1
    10000270 1
    10000292 1
    10000293 1
    10000315 1
    10000328 1
    10000344 1
    10000393 1
    10000404 1
    10000415 1
    10000419 1
    10000422 1
    10000455 1
    10000492 1
    10000497 1
    10000533 1
    10000549 1
    10000556 1
    10000592 1
    10000601 1
    10000608 1
    10000658 1
    10000700 1
    10000712 1
    10000713 1
    10000764 1
    10000771 1
    10000774 1
    10000787 1
    10000855 1
    10000894 1
    10000902 1
    10000952 1
    10000970 1
    10001003 1
    10001004 1
    10001069 1
    10001142 1
    10001193 1
    10001200 1
    10001214 1
    10001292 1
    10001332 1
    10001334 1
    10001353 1
    10001425 1
    10001426 1
    10001434 1
    10001445 1
    10001482 1
    10001523 1
    10001609 1
    10001690 1
    10001720 1
    10001775 1
    10001782 1
    10001827 1
    10001865 1
    10001883 1
    10001924 1
    10001959 1
    10001969 1
    10002006 1
    10002084 1
    10002150 1
    10002174 1
    10002193 1
    10002227 1
    10002311 1
    10002325 1
    10002403 1
    10002409 1
    10002437 1
    10002498 1
    10002511 1
    10002568 1
    10002580 1
    10002592 1
    10002595 1
    10002600 1
    10002620 1
    10002714 1
    10002730 1
    10002732 1
    10002749 1
    10002816 1
    10002840 1
    10002870 1
    10002941 1
    10002970 1
    10003085 1
    end
    tempfile using
    save `using'
    
    use `master', clear
    merge 1:1 householdnumber household_number_member using `using', keep(match master)
    However, perhaps you don't want to keep those from the master data set who have no matching observation in the using data set (which entails also losing those in the master data set with household_member_number != 1). In that case, replace -keep(match master)- with -keep(match)-. Then you will be left only with those where household_member_number == 1 and a match in the using data set.

    Note that the keyword using must appear in the -merge- command before the name of the using data set, even if the using data set itself is called using.

    Note that the -merge- command does not allow -if- qualifiers. In your case, since the using data set has household_number_member == 1 in all of its observations, no such specification is needed anyway-- only observations in the master data with household_number_member == 1 will find a match. If it were not the case that household_number_member == 1 in all observations of the using data set, you could impose that restriction in a separate command after the -merge-, as -keep if household_number_member == 1-.

    Thank you for using -dataex- on your very first post!

    Added: And if you were in a situation where the using data set didn't mention household_number_member (which would, often as not, be true as it is common to eliminate variables that only take on a single value. Then you would code -merge m:1 household_number using `using', keep(match master)-, or perhaps with -keep(match)- if you only wanted matching observations.
    Last edited by Clyde Schechter; 17 Oct 2021, 15:11.

    Comment

    Working...
    X