Matching sample

Enny Cruz

Join Date: Jun 2016

Posts: 6
#1

Matching sample

14 Jun 2016, 09:47

I am doing a matched case control, I have my cases (27964) in one dataset and my possible controls (2900000) in another one.
I would like to match the cases based on date (dnoutcome) and place of birth (id). both files have these information and an unique identifier. I would need 10 matching controls for each case.
How can I find a matching sample? My controls sample have to be unique for each case.
Could I do it using Stata.
Thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#2

14 Jun 2016, 10:34

Do you really have a place of birth variable named id, and then some other variable for a unique identifier? I'd lose my mind trying to keep that straight through a code file. So I'm going to use your dnoutcome variable, and call the place of birth variable birth_place. I'll call the unique identifier in the file of cases uid_case, and that in the file of possible controls uid_control. (They must have different names: if they currently don't, rename at least one of them.)

Code:

use cases, clear joinby dnoutcome birth_place using `controls', unmatched(master)

At this point, your data in memory will have each case matched up with every potential control that has the same values for dnoutcome and birth_place. Now you need to weed it down to 10 per case. Often people will have other criteria for selecting the "best 10" matches, such as nearest possible in age, or same gender, or something like that. If you want to restrict further in this way, you can add gender to the variable list in the -joinby- command, and you can restrict on age by following the code above with -keep if abs(age_control - age_case) < my_age_range_threshold-. If there are no such criteria, and you just want to pick them at random, you can do:

Code:

set seed my_lucky_number_here gen double shuffle = runiform() by case_id (shuffle), sort: keep if _n <= 10 drop shuffle

Notes:
1. It is possible that some cases will not have 10 satisfactory matches, and even possible that some will not have any.
2. This assignment is sampling with replacement, that is, different cases may have some (or all) the same controls matched to them if the cases agree on birth_place and dnoutcome. If you don't want to "re-use" controls in this way, it can be done but the code is much more complicated.
3. If your data sets are very large (several millions of observations), a single double-precision random number may be insufficient to determine a unique sort order in the second block of code. In that case, also do -gen double shuffle2 = runiform()- and change -by case_id (shuffle), sort:...- to -by case_id (shuffle shuffle2), sort:...-
1 like
Comment
Enny Cruz

Join Date: Jun 2016

Posts: 6
#3

14 Jun 2016, 10:53

Hi
Thank you very much for your support
I do not want to re-use my controls. Could you help me with the more complicated code?
Please
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29959

14 Jun 2016, 11:15

Try this:

Code:

use cases, clear
joinby dnoutcome birth_place using `controls', unmatched(master)
set seed my_lucky_number_here
gen double shuffle = runiform() 
by case_id (shuffle), sort: gen control_num = _n

// LOOP OVER CASES
levelsof case_id, local(cases)
gen byte surviving = 1
foreach c of local cases {
    // RETAIN FIRST 10 SURVIVING CONTROLS FOR THIS CASE
    replace surviving = 0 if case_id == `c' & control_num > 10
    
    // ELIMINATE THOSE CONTROLS FROM CONSIDERATION FOR LATER CASES
    levelsof control_id if case_id == `c' & surviving, local(to_kill)
    foreach tk of local to_kill {
        replace surviving = 0 if case_id != `c' & control_id == `tk'
    }
    drop if surviving == 0
    
    // RE-NUMBER REMAINING CONTROLS
    by case_id (shuffle), sort: replace control_num = _n
}

Note: I don't have a useful set of data available to test this on, so I may have left some typos in there. I hope not.

As before, it is possible that there will be some cases with fewer than 10 controls available, or even with none. In fact, because you are not re-using controls, this is even more likely than before.

Comment

Enny Cruz

Join Date: Jun 2016

Posts: 6
#5

15 Jun 2016, 07:00

I am trying the code, but it still running since yesterday.
Thank you very much.
I let you know as soon as finish.
regards
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

15 Jun 2016, 12:08

Here's an alternate solution that uses rangejoin (from SSC). To install it, type in Stata's Command window:

Code:

ssc install rangejoin

I assume that cases and controls are to be matched only if they have the same value for dnoutcome and place_id. Also, once a control is picked, it cannot be picked again. If both cases and controls are randomly ordered within dnoutcome place_id groups, then the first case in the group can be matched to the first 10 controls in the group. The second case can be matched to controls #11 to 20, and so on.

Here's a quick sketch of what it would look like. The following code runs in about 12 seconds on my computer:

Code:

set seed 1234321

* generate cases with 5 possible outcomes from about 300 places
clear
set obs 27964
gen case_id = _n
gen dnoutcome =  runiformint(1,5)
gen place_id = int(1 / runiform())
save "cases.dta", replace

* repeat for controls
clear
set obs 2900000
gen control_id = _n
gen dnoutcome =  runiformint(1,5)
gen place_id = int(1 / runiform())

* mix up the controls randomly within groups of dnoutcome place_id
* and assign a pick order
gen mix_it_up = runiform()
bysort dnoutcome place_id (mix_it_up): gen pick_order = _n
save "controls.dta", replace

* mix up cases randomly within groups of dnoutcome place_id and
* assign pick numbers using indexes (e.g. the first obs picks 1 to 10;
* the second obs picks 11 to 20, etc)
use "cases.dta", clear
gen mix_it_up = runiform()
bysort dnoutcome place_id (mix_it_up): gen low = (_n - 1) * 10 + 1
gen high = low + 10 - 1

* pair each case with the controls picked at random
rangejoin pick_order low high using "controls.dta", by(dnoutcome place_id)

sort case_id pick_order

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29959
#7

15 Jun 2016, 13:03

Wow, that is outstanding, Robert! A very clever application of -rangejoin-.
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#8

16 Jun 2016, 18:20

I agree! Bravo!

Last edited by Steve Samuels; 16 Jun 2016, 18:29.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Enny Cruz

Join Date: Jun 2016

Posts: 6
#9

20 Jun 2016, 03:56

Thank you!!!! it worked! Thank you very much!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#10

20 Jun 2016, 10:07

I was reading my solution in #6 and it dawned on me that the rangejoin part is not necessary. You can combine the cases and controls using a simple merge. All you need is to create a key merge variable in both datasets that follows the matching order, once the data has been jumbled in both dataset. In other words, within groups of dnoutcome place_id, the first random case is matched with the first 10 random controls, the second case with the next 20 controls, and so on.

Here's an example using the same fake data

Code:

set seed 1234321

* generate cases with 5 possible outcomes from about 300 places
clear
set obs 27964
gen long case_id = _n
gen dnoutcome =  runiformint(1,5)
gen place_id = int(1 / runiform())
gen case_var1 = runiform()
save "cases.dta", replace

* repeat for controls
clear
set obs 2900000
gen long control_id = _n
gen dnoutcome =  runiformint(1,5)
gen place_id = int(1 / runiform())
gen control_var1 = runiform()
save "controls.dta", replace

* mix up the controls randomly within groups of dnoutcome place_id
* and assign a pick order and a -pick- var to use with -merge-
gen double mix_it_up = runiform()
bysort dnoutcome place_id (mix_it_up): gen pick_order = _n
gen long pick = int((pick_order - 1) / 10) + 1
save "controls2pick.dta", replace

* mix up cases randomly within groups of dnoutcome place_id and
* assign a pick order
use "cases.dta", clear
gen double mix_it_up = runiform()
bysort dnoutcome place_id (mix_it_up): gen long pick = _n
merge 1:m dnoutcome place_id pick using "controls2pick.dta", keep(master match) nogen
sort case_id pick_order mix_it_up

save "results_merge.dta", replace

This merge solution requires just a little over 3 seconds on my computer to jumble the order and produce the results. The rangejoin solution adds an extra 13 seconds because of the extra work needed to find observations within the specified range.

Both solution produce exactly the same results if the observations are jumbled the same way. I noticed that the example in #6 is not deterministic because of duplicate values in the mix_it_up variable. The fix is to create the mix_it_up variables as doubles. Here's the same example using rangejoin and a check that results match the results from the merge example above.

Code:

set seed 1234321

* generate cases with 5 possible outcomes from about 300 places
clear
set obs 27964
gen long case_id = _n
gen dnoutcome =  runiformint(1,5)
gen place_id = int(1 / runiform())
gen case_var1 = runiform()
save "cases.dta", replace

* repeat for controls
clear
set obs 2900000
gen long control_id = _n
gen dnoutcome =  runiformint(1,5)
gen place_id = int(1 / runiform())
gen control_var1 = runiform()
save "controls.dta", replace

* mix up the controls randomly within groups of dnoutcome place_id
* and assign a pick order
gen double mix_it_up = runiform()
bysort dnoutcome place_id (mix_it_up): gen pick_order = _n
save "controls2pick.dta", replace

* mix up cases randomly within groups of dnoutcome place_id and
* assign pick numbers using indexes (e.g. the first obs picks 1 to 10;
* the second obs picks 11 to 20, etc)
use "cases.dta", clear
gen double mix_it_up = runiform()
bysort dnoutcome place_id (mix_it_up): gen long low = (_n - 1) * 10 + 1
gen long high = low + 10 - 1

* pair each case with the controls picked at random
rangejoin pick_order low high using "controls2pick.dta", by(dnoutcome place_id)

sort case_id pick_order mix_it_up
save "results_rangestat.dta", replace

cf dnoutcome place_id case_id case_var1 control_id control_var1 using "results_merge.dta", all

Comment

Enny Cruz

Join Date: Jun 2016

Posts: 6
#11

08 Jul 2016, 10:21

Hi
I just tested this last code in my file and I got 294 duplicates in the unique identifier for the controls.
I would like to know why.
And once more thank you for your help.
BW
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

09 Jul 2016, 07:50

I'm traveling so I don't have time to think much about this. It might have to do with precision. Make sure that all variables are double or long. I note that I forgot one case in the examples above. You should change:

Code:

bysort dnoutcome place_id (mix_it_up): gen pick_order = _n

to

Code:

bysort dnoutcome place_id (mix_it_up): gen long pick_order = _n
Comment
Papa Edu

Join Date: Sep 2016

Posts: 5
#13

06 Sep 2016, 19:42

Hi, Not sure if this works for me. But I have a data set on a set of households with durable goods. I am trying to match each household (if they own a durable good) to to the sum of the (amount of durable goods they own). i.e creating an index. After doing this, I am restricting n to 1. This is the code I have used so far

bys case_id: egen index1=sum( hh_l03 )
. br case_id hh_l01 index1
. bys case_id: gen n=_n
by case_id, sort: keep if _n<=1

What is the purpose of the latter part of the command above. I will post it on here just to make it easier
mix up the controls randomly within groups of dnoutcome place_id * and assign a pick order and a -pick- var to use with -merge- gen double mix_it_up = runiform() bysort dnoutcome place_id (mix_it_up): gen pick_order = _n gen long pick = int((pick_order - 1) / 10) + 1 save "controls2pick.dta", replace * mix up cases randomly within groups of dnoutcome place_id and * assign a pick order use "cases.dta", clear gen double mix_it_up = runiform() bysort dnoutcome place_id (mix_it_up): gen long pick = _n merge 1:m dnoutcome place_id pick using "controls2pick.dta", keep(master match) nogen sort case_id pick_order mix_it_up
Comment
Papa Edu

Join Date: Sep 2016

Posts: 5
#14

22 Sep 2016, 18:29

sorry this data looks rundown. very new to stata
Comment
Naia Ortelan

Join Date: Aug 2021

Posts: 1
#15

31 Aug 2021, 12:37

Hi! Enny was helping me to match a case control analysis, and I used the codes mentioned above by Robert (using rangejoin), but unfortunately the code error "r(198) invalid syntax" appeared. Would you have any idea what might have happened?

I have my cases (8,676) in one dataset and my controls (916,405) in another one. I would like to match the cases based on birth date (birth_date) and municipality of residence (mun_code). Both files have these information and an unique identifier. I also will match 10 controls for each case and my controls sample have to be unique for each case.

Code:

use "controls.dta, clear" gen double mix_it_up = runiform() bysort birth_date mun_code (mix_it_up): gen pick_order = _n save "controls_pick.dta" use "cases.dta, clear" gen double mix_it_up = runiform() bysort birth_date mun_code (mix_it_up): gen long low = (_n-1)*10+1 gen long high = low+10-1 *kept using "cases.dta": rangejoin pick_order low high using "controls_pick.dta", by(birth_date mun_code) (using rangestat version 1.1.1) invalid syntax r(198);
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment