Matching cases and controls based on age and gender

Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#16

22 Sep 2016, 06:12

Greetings,

In the above code for rangejoin, if I want to do 1:4 matching (1 case and 4 controls) what do I have to do? I did the following to create 1:1 matching. In the following code, in the highlighted text, if I do by id (shuffle), sort: keep if _n < 5, will that give me 1:4 matchung? thanks.

preserve
keep if Group == 0
tempfile controls
save `controls'
restore
keep if Group ==1
rangejoin age -10 10 using `controls', by(Sex)
set seed 1234
gen double shuffle = runiform()
by id (shuffle), sort: keep if _n == 1
drop shuffle
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#17

22 Sep 2016, 09:27

Yes it will. But you might want to do a bit better. That will simply select a random four out of all of the controls that agree with the case on Sex and come within 10 on age. But since I gather you are trying to match as tightly as possible, you might want to pick the four closest age matches rather than just four at random. So for that you could do:

Code:

gen delta_age = abs(age-U_age) by id (delta), sort: keep if _n < 5
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#18

22 Sep 2016, 13:54

Thank you.

Also, is there a way to do either 1:1 or 1:4 matching such that controls are selected without replacement based on age range and sex? That is if I want to count one patient only once be is a case or a control.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#19

22 Sep 2016, 14:30

Yes, it is possible. A downside is that you may end up with more cases that go unmatched altogether.

Another downside is that the code to do this is somewhat complicated. Since there are very few analyses in which matching without replacement is statistically preferable to matching with replacement, I don't want to take the trouble to write it out unless you are 150% sure you really need it. It's a bigger deal in terms of code complexity, and also execution time. If you are being pressed by an advisor to do matching without replacement, another possibility to consider is matching without replacement on sex and age groups (not age range). The code for that is actually quite simple and runs very quickly compared to age-range matching (with or without replacement). Though again, you may find you have more unmatched cases in the end.

From the perspective of controlling confounder bias in your analysis, you are probably best off with what you already have: matching on sex and nearest age with replacement.

Added: Looking back at your earlier posts, it seems you have a very large separation between the cases and controls for the age distribution. This implies that, no matter what you do, many cases are going to fail to find any good match. Even if you got a perfect match everybody that could be matched, the number of cases and controls who are simply excluded by the matching could be as big a problem for the interpretability of your study findings as the confounding bias due to the age difference. Unless age is a very sensitive predictor of your outcome variable (whicih, if it's bloodloss during some procedures would surprise me), I think you should err on the side of getting more matches even if they are not so close. That would, in turn, argue in favor of matching with replacement.

Last edited by Clyde Schechter; 22 Sep 2016, 14:45.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#20

22 Sep 2016, 14:57

Thank you so much for that explanation. I agree on matching with replacement but as you said I do have PI bugging me to to do matching without replacement. Just for my knowledge and for future reference what would the code be like if I use sex and age groups instead and try to do matching without replacement?

Once again I am very grateful for your insight.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

#21

22 Sep 2016, 17:06

For 1:1 age-group sex matching without replacement:

Code:

//    READ IN DATA FILE OF COMBINED CASES & CONTROLS
use combined_cases_and_controls, clear
set seed 1234 // OR YOUR FAVORITE SEED

//    GENERATE AGE GROUPS (MODIFY LIMITS AS APPROPRIATE TO DATA)
gen byte age_group = 1 if inrange(age, 25, 39)
replace age_group = 2 if inrange(age, 40, 49)
replace age_group = 3 if inrange(age, 50, 54)
replace age_group = 4 if inrange(age, 55, 59)
replace age_group = 5 if inrange(age, 60, 64)
replace age_group = 6 if inrange(age, 65, 69)
replace age_group = 7 if inrange(age, 70, .)

gen double shuffle = runiform() // TO RANDOMIZE MATCH SELECTIONS

//    FORM A FILE OF CONTROLS ONLY
preserve
keep if Group == 2
//    ASSIGN A PRIORITY FOR MATCHING WITHIN EACH AGE_GROUP SEX COMBINATION
by age_group Sex (shuffle), sort: gen int priority = _n
drop shuffle
//    RENAME VARIABLES TO AVOID CLASH
rename * control_*
foreach x in age_group Sex priority {
    rename control_`x' `x'
}
tempfile controls
save `controls'

//    NOW MAKE A FILE OF CASES
restore
keep if Group == 1
//    AGAIN PRIORITIZE FOR MATCHING
by age_group Sex (shuffle), sort: gen int priority = _n
drop shuffle
//    MERGE WITH CONTROLS
merge 1:1 age_group Sex priority using `controls', keep(master match)

Note: you may need to use different limits to define your age groups so that you get decent numbers of matches in these categories. You need to look at the distributions of ages in both groups. For the range of ages that shows the most overlap between the groups, you can use narrower age bands, and for those ages with little overlap, use wide ones. This will give you a decent tradeoff between closeness of matching and getting matches at all.

The above code is not tested: it may contain typos, punctuation errors, etc.

Now, if you want 1:4 matching on age-groups and sex without replacement, it's only a bit more complicated. The difference is that in the controls, instead of assigning a unique priority for matching to each observation, you do that in batches of four. And the final merge becomes 1:m instead of 1:1.

Code:

//    READ IN DATA FILE OF COMBINED CASES & CONTROLS
use combined_cases_and_controls, clear
set seed 1234 // OR YOUR FAVORITE SEED

//    GENERATE AGE GROUPS (MODIFY LIMITS AS APPROPRIATE TO DATA)
gen byte age_group = 1 if inrange(age, 25, 39)
replace age_group = 2 if inrange(age, 40, 49)
replace age_group = 3 if inrange(age, 50, 54)
replace age_group = 4 if inrange(age, 55, 59)
replace age_group = 5 if inrange(age, 60, 64)
replace age_group = 6 if inrange(age, 65, 69)
replace age_group = 7 if inrange(age, 70, .)

gen double shuffle = runiform() // TO RANDOMIZE MATCH SELECTIONS

//    FORM A FILE OF CONTROLS ONLY
preserve
keep if Group == 2
//    ASSIGN A PRIORITY FOR MATCHING WITHIN EACH AGE_GROUP SEX COMBINATION
//    IN BATCHES OF (UP TO) FOUR
 by age_group Sex (shuffle), sort: gen int priority = floor((_n-1)/4) + 1
drop shuffle
//    RENAME VARIABLES TO AVOID CLASH
rename * control_*
foreach x in age_group Sex priority {
    rename control_`x' `x'
}
tempfile controls
save `controls'

//    NOW MAKE A FILE OF CASES
restore
keep if Group == 1
//    AGAIN PRIORITIZE FOR MATCHING
by age_group Sex (shuffle), sort: gen int priority = _n
drop shuffle
//    MERGE WITH CONTROLS
merge 1:m age_group Sex priority using `controls', keep(master match)

Comment

Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#22

22 Sep 2016, 20:47

Thank you.
Comment

Rieza Soelaeman

Join Date: Dec 2015
Posts: 5

#23

14 Aug 2017, 11:00

Hi Clyde,
I'm reading through the code you posted in this thread on creating controls and comparing it to a response on the old Statalist, which I had used in the past for some analysis. I have reproduced the code below for your reference. The original post can be found here: http://www.stata.com/statalist/archi.../msg00326.html

Code:

clear
// mock up control data
set seed 846
set obs 500  // don't know how many controls you have
gen byte case = 0
gen byte age = 20 +ceil(65*runiform())  // broad age range assumed
tempfile controls
sort age
save `controls'
clear
// mock up cases
set obs 63
gen byte case = 1
gen byte age = 20 +ceil(65*runiform())
//
// The real stuff starts here; you have an existing control file you can append to your cases.
append using `controls'
gen rand = runiform()
sort age case rand
by age: egen ncases = sum(case)
keep if (ncases >=1) // age groups with no cases are irrelevant
//
// The following keeps the first 2 controls  for each case within each age group
by age: keep if (case ==1) | ((_n <= 2*ncases) & (case == 0))
tab2 age case
by age: egen ncontrols = sum(case == 0)
count if (ncontrols < 2*ncases)

I was wondering if you could comment on the differences between the method you posted earlier vs. the code I copied above as far as matching is concerned, or is they are just two different way at getting to the same thing? Thanks

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#24

14 Aug 2017, 11:27

At the detailed level, the code you show matches only on age, whereas the code I wrote matches on age group and sex. But, from a broader perspective, either approach will give a matching on the specified variables, randomly selecting from the controls without replacement. In my code, each case will appear in several observations, in each case linked to one matched control. In the code in #23, the linkage of a case to its matched controls is implicit in the order of the observations, but is not explicit in the data; if the data sort order is changed, the linkage will be lost.
1 like
Comment
Rieza Soelaeman

Join Date: Dec 2015

Posts: 5
#25

17 Aug 2017, 14:52

Thanks, Clyde
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#26

20 Aug 2017, 14:14

For 1:1 matches, you may wish to take a look at the program - ccmatch - , written by Daniel Cook.

Best regards,

Marcos
1 like
Comment
Shannon Lange

Join Date: Sep 2017

Posts: 18
#27

06 Sep 2017, 16:03

Hi Clyde and Priyanka,

I am a novice Stata user (I have used SAS for years), and have been using the code provided in this feed to link controls to my cases by age and sex. Here is my code:

preserve
keep if diagnosis==0
tempfile controls
save controls

restore
keep if diagnosis==3

rangejoin age -0.5 0.5 using controls, by(sex)

set seed 217
gen double shuffle = runiform()
by id_num (shuffle), sort: keep if _n == 1
drop shuffle

My issue is right now is that I can't see if the matching has worked. If I look at the "Data Editor" I am only seeing my cases - does this mean it didn't work? Does this code produce a new dataset?

Thank you!

Last edited by Shannon Lange; 06 Sep 2017, 16:14.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#28

06 Sep 2017, 16:15

Well, for one thing, your tempfile isn't being referred to properly. See corrections in code below. But I don't think that's the issue, because what you have shown should still work, the difference being that your working directory will have a file called controls.dta that you didn't intend to create.

Code:

preserve keep if diagnosis==0 tempfile controls save `controls' restore keep if diagnosis==3 rangejoin age -0.5 0.5 using `controls', by(sex) set seed 217 gen double shuffle = runiform() by id_num (shuffle), sort: keep if _n == 1 drop shuffle

The code should leave you with the data in memory consisting of the original cases, and then with each case for which a match could be found there will be additional variables that are just like the variables in the cases, but prefixed with U_. These are the values of the variables for the matched control.

If you are not seeing that, then I think there may be a problem with your data. Are you share that your starting data set has both cases and controls in it. Try -count-ing them at the beginning, or just -tab diagnosis- and make sure that there really are 0's and 3's.. Another possibility is that in your data a 0.5 year radius for age is too stringent and there just aren't any controls to be found that fulfill that criterion. (This would particularly be the case if your age variable is an integer.)

If this advice doesn't help you solve the problem, then I think you should post back, using the -dataex- command (see FAQ #12 if you are not familiar with it) to show example data. Be sure your example includes some controls that you think should actually match to some of the cases in the example.
Comment
Shannon Lange

Join Date: Sep 2017

Posts: 18
#29

06 Sep 2017, 16:57

That worked! Thank you so much for your quick reply Clyde.
Comment
Shannon Lange

Join Date: Sep 2017

Posts: 18
#30

07 Sep 2017, 06:58

Another question...

Can you add another variable to this code that you wish to also match on? For instance, IQ scores within a range of 10. Can you add it to the line of code below as I have (which isn't quite working) or do you need an additional line of cade?

rangejoin age -0.5 0.5 IQ -10 10 using `controls', by(sex)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment