Matching Case-Control Code Needed

Erin Feddema

Join Date: Mar 2019

Posts: 5
#1

Matching Case-Control Code Needed

26 Mar 2019, 15:22

Hello Everyone! I'm wondering if someone could help me with the code for matching a case-control population or point me in the direction of literature/existing code. We have a 122 cases and 85 controls (recruitment out of a cancer clinic). We would like assess two scenarios for matching to find the best fit for the dataset.

Matching variables:
-gender
-cotinine (+/- 100 pg)
-years_smoked (+/- 5 years) - (desired but likely won't be able to match with the third variable due to sample size).

Scenarios:
1. N 1:1
2. N 1:3 with repeated cases in the pool. We would like to keep the ratio favoring controls because the number of matches will be low to keep the power in check.

Any help would be much appreciated!
Erin
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

26 Mar 2019, 15:47

There is a community-contributed program -calipmatch- from SSC that will at least get you the exact match on gender and one of the two range matches, from which you could then keep only those that also fall in range on the third variable. (Or maybe -calipmatch- can do multiple range matches. I'm not sure because I don't use it myself, and right now I can't check because the SSC website seems to be down as I write this.)

So here's how to do it with just official Stata commands:

Code:

use dataset, clear preserve keep if case drop case tempfile cases save `cases' restore drop if case drop case ds gender, not rename (`r(varlist)') =_ctrl tempfile controls save `controls' use `cases' joinby gender using `controls', unmatched(master) keep if abs(cotinine - cotinine_ctrl) <= 100 /// & abs(years_smoked - years_smoked_ctrl) <= 5 set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle = runiform() duplicates drop by id (shuffle), sort: keep if _n <= 3 drop shuffle

In addition to the matching variables mentioned, I assume the data set has an ID number of each patient, called id, and a variable case which is coded 1 for cases and 0 for controls.

At the end of this code, the data in memory will have up to three observations for each of the original cases. Each observation has the case paired with a control who meets the three matching criteria. The variables describing the control will all have _ctrl suffixed to their names. For the matchup with just 1 control, just retain the first control matched to each case.

Note: No example data was provided, so beware of typos or other errors as this code is untested. In the future, please provide example data when asking for help with code, and use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#3

26 Mar 2019, 19:15

-calipmatch- does allow calipers on more than one variable at a time; however, I believe it only does matching without replacement; the user-written command -vmatch- also allows more than one variable with calipers (yes they can differ) and does matching with replacement; each can be found, with installation instructions by using the -search- command
1 like
Comment
Erin Feddema

Join Date: Mar 2019

Posts: 5
#4

31 Mar 2019, 19:02

Thank you both for your responses. I greatly appreciate them! So far I've used the Stata commands to match, with success. I adjusted the variable names to match my dataset (I was using cleaner names). I want to try the -calipmatch- and -vmatch- commands next, to try matching with replacement. I am working with an MD's dataset for my MPH thesis, so this is a learning process for me. I may have more questions once I look into these further. Thank you again!

Here is the adjusted code I used and my -dataex-.
use tobacco_biomarkers, clear
preserve
keep if case
drop case
tempfile cases
save `cases'
restore
drop if case
drop case
ds gender, not
rename (`r(varlist)') =_ctrl
tempfile controls
save `controls'

use `cases'
joinby gender using `controls', unmatched(master)
keep if abs(total_urinary_cotinine - total_urinary_cotinine_ctrl) <= 100 & abs(smoke_duration - smoke_duration_ctrl) <= 5

set seed 1234
gen double shuffle = runiform()
duplicates drop
by record_id (shuffle), sort: keep if _n <= 3
drop shuffle

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str10 record_id byte case int total_urinary_cotinine byte gender float smoke_duration "101" 1 3022 1 40 "102" 1 1818 1 20 "103" 1 3629 1 35 "104" 1 3343 1 10 "105" 1 802 1 60 "106" 1 7954 1 0 "107" 1 . 1 45 "108" 1 . 1 60 "109" 1 4780 1 40 "110" 1 4610 1 0 "111" 1 4227 2 22 "112" 1 1367 1 35 "113" 1 6540 1 0 "114" 1 5047 1 24 "115" 1 2672 1 44 "116" 1 2170 1 5 "117" 1 83 1 20 "118" 1 4128 2 28 "119" 1 889 2 50 "120" 1 1412 2 0 end
Comment
Erin Feddema

Join Date: Mar 2019

Posts: 5
#5

04 Jun 2019, 16:03

Hello Again, I am wondering how you would re-write the code to match only for cotinine (not first by gender). Is there an easy modification to the code you listed or would it be best to to use calipmatch or vmatch?

Thank you - I greatly appreciate your help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

04 Jun 2019, 16:07

Replace -joinby gender using `controls', unmatched(master)- with -cross using `controls'-. The rest of the code would be unchanged.
Comment
Erin Feddema

Join Date: Mar 2019

Posts: 5
#7

05 Jun 2019, 07:44

Thank you so much!
Comment
Erin Feddema

Join Date: Mar 2019

Posts: 5
#8

17 Jun 2019, 22:10

Hello again, I am wondering now how it would be to match by creating a new variable with a match id instead of relocating matched data into the same row. I ask because I would like to do the ranksum test, and it seems like I need to have one variable to test and one grouping variable (case/control). Any help is much appreciated!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

17 Jun 2019, 22:15

The -ranksum- test is not valid for paired data. The closest thing to that is the -signrank- test, which would use the data in wide layout as you currently have it.
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 98
#10

18 Mar 2025, 11:07

Originally posted by Clyde Schechter View Post

There is a community-contributed program -calipmatch- from SSC that will at least get you the exact match on gender and one of the two range matches, from which you could then keep only those that also fall in range on the third variable. (Or maybe -calipmatch- can do multiple range matches. I'm not sure because I don't use it myself, and right now I can't check because the SSC website seems to be down as I write this.)

So here's how to do it with just official Stata commands:

Code:

use dataset, clear preserve keep if case drop case tempfile cases save `cases' restore drop if case drop case ds gender, not rename (`r(varlist)') =_ctrl tempfile controls save `controls' use `cases' joinby gender using `controls', unmatched(master) keep if abs(cotinine - cotinine_ctrl) <= 100 /// & abs(years_smoked - years_smoked_ctrl) <= 5 set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle = runiform() duplicates drop by id (shuffle), sort: keep if _n <= 3 drop shuffle

In addition to the matching variables mentioned, I assume the data set has an ID number of each patient, called id, and a variable case which is coded 1 for cases and 0 for controls.

At the end of this code, the data in memory will have up to three observations for each of the original cases. Each observation has the case paired with a control who meets the three matching criteria. The variables describing the control will all have _ctrl suffixed to their names. For the matchup with just 1 control, just retain the first control matched to each case.

Note: No example data was provided, so beware of typos or other errors as this code is untested. In the future, please provide example data when asking for help with code, and use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

Hi, I know this is an old post but I have used the script provided in #2 several times before in different projects. Now my problem is the size of the dataset (millions of observations). I want to match cases with controls on the variables age and sex and then randomly choose 5 of the matching controls for each case.
The line

Code:

joinby gender using `controls', unmatched(master)

takes days to run. Most likely because there are so many potential controls for each case ( i.e. there are several matches based on age and sex for each case).

Is there a solution that randomly matches the controls, not needing to join all available matches but just the first five controls. I hope you understand my question.

Best regards,

Jesper Eriksson
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

18 Mar 2025, 11:45

I'm actually surprised that the problem is that -rangejoin- is taking too long to run because the data set is so large. The usual problem in that circumstance is that the resulting intermediate data set would exceed available memory, so usually it just aborts with an "Op. sys. refuses to provide memory" message. I don't think I've ever seen your situation before.

Be that as it may, I don't know of any way to join only the first five matching controls. But, there is a way to do this faster than -joinby- will allow, by using, instead, the -rangejoin- command. It is written by Robert Picard and available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

Code:

use dataset, clear preserve keep if case drop case tempfile cases save `cases' restore drop if case drop case ds gender, not rename (`r(varlist)') =_ctrl tempfile controls save `controls' use `cases' gen lb = cotinine - 100 gen ub = cotinine + 100 rangejoin cotinine_ctrl lb ub using `controls', by(gender) keep if abs(years_smoked - years_smoked_ctrl) <= 5 set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle1 = runiform() gen double shuffle2 = runiform() duplicates drop by id (shuffle*), sort: keep if _n <= 5 drop shuffle

Notes: I'm assuming here that getting the match on cotinine is harder than getting the match on years smoked, so that applying -rangejoin- on the cotinine match will produce a smaller resulting data set. But if the match on years smoked is easier, you should reverse the roles of those variables in the above code. -rangejoin- is faster than -joinby- and also produces an intermediate data set that includes only allowable matches on the cotinine variable. Note that I have also modified the random selection, using two random variables. The reason is that in a data set with several million observations, a single double-precision random variable may have some repeated values, so that the sorting, and hence the selection of controls, would be indeterminate and irreproducible. The use of two double-precision random variables overcomes this potential difficulty. It is only necessary to do this when working with very large data sets (several million observations or more).
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 98
#12

18 Mar 2025, 15:13

Originally posted by Clyde Schechter View Post

I'm actually surprised that the problem is that -rangejoin- is taking too long to run because the data set is so large. The usual problem in that circumstance is that the resulting intermediate data set would exceed available memory, so usually it just aborts with an "Op. sys. refuses to provide memory" message. I don't think I've ever seen your situation before.

Be that as it may, I don't know of any way to join only the first five matching controls. But, there is a way to do this faster than -joinby- will allow, by using, instead, the -rangejoin- command. It is written by Robert Picard and available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

Code:

use dataset, clear preserve keep if case drop case tempfile cases save `cases' restore drop if case drop case ds gender, not rename (`r(varlist)') =_ctrl tempfile controls save `controls' use `cases' gen lb = cotinine - 100 gen ub = cotinine + 100 rangejoin cotinine_ctrl lb ub using `controls', by(gender) keep if abs(years_smoked - years_smoked_ctrl) <= 5 set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle1 = runiform() gen double shuffle2 = runiform() duplicates drop by id (shuffle*), sort: keep if _n <= 5 drop shuffle

Notes: I'm assuming here that getting the match on cotinine is harder than getting the match on years smoked, so that applying -rangejoin- on the cotinine match will produce a smaller resulting data set. But if the match on years smoked is easier, you should reverse the roles of those variables in the above code. -rangejoin- is faster than -joinby- and also produces an intermediate data set that includes only allowable matches on the cotinine variable. Note that I have also modified the random selection, using two random variables. The reason is that in a data set with several million observations, a single double-precision random variable may have some repeated values, so that the sorting, and hence the selection of controls, would be indeterminate and irreproducible. The use of two double-precision random variables overcomes this potential difficulty. It is only necessary to do this when working with very large data sets (several million observations or more).

Thank you (again!) Clyde. Works like a charm and testing using only small parts of my dataset seems to give a great time decrease. Thank you!
Comment
Jesper Eriksson

Join Date: Oct 2016

Posts: 98
#13

19 Mar 2025, 00:53

Originally posted by Clyde Schechter View Post

I'm actually surprised that the problem is that -rangejoin- is taking too long to run because the data set is so large. The usual problem in that circumstance is that the resulting intermediate data set would exceed available memory, so usually it just aborts with an "Op. sys. refuses to provide memory" message. I don't think I've ever seen your situation before.

Be that as it may, I don't know of any way to join only the first five matching controls. But, there is a way to do this faster than -joinby- will allow, by using, instead, the -rangejoin- command. It is written by Robert Picard and available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

Code:

use dataset, clear preserve keep if case drop case tempfile cases save `cases' restore drop if case drop case ds gender, not rename (`r(varlist)') =_ctrl tempfile controls save `controls' use `cases' gen lb = cotinine - 100 gen ub = cotinine + 100 rangejoin cotinine_ctrl lb ub using `controls', by(gender) keep if abs(years_smoked - years_smoked_ctrl) <= 5 set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle1 = runiform() gen double shuffle2 = runiform() duplicates drop by id (shuffle*), sort: keep if _n <= 5 drop shuffle

Notes: I'm assuming here that getting the match on cotinine is harder than getting the match on years smoked, so that applying -rangejoin- on the cotinine match will produce a smaller resulting data set. But if the match on years smoked is easier, you should reverse the roles of those variables in the above code. -rangejoin- is faster than -joinby- and also produces an intermediate data set that includes only allowable matches on the cotinine variable. Note that I have also modified the random selection, using two random variables. The reason is that in a data set with several million observations, a single double-precision random variable may have some repeated values, so that the sorting, and hence the selection of controls, would be indeterminate and irreproducible. The use of two double-precision random variables overcomes this potential difficulty. It is only necessary to do this when working with very large data sets (several million observations or more).

A final question, the code you provided in #11 results in matching with replacements. Is there a twist where you can get without replacements?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

19 Mar 2025, 08:45

With the caveat that this is untested due to absence of a data example to work with:

Code:

use dataset, clear preserve keep if case drop case tempfile cases save `cases' restore drop if case drop case ds gender, not rename (`r(varlist)') =_ctrl tempfile controls save `controls' use `cases' gen lb = cotinine - 100 gen ub = cotinine + 100 rangejoin cotinine_ctrl lb ub using `controls', by(gender) keep if abs(years_smoked - years_smoked_ctrl) <= 5 set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle1 = runiform() gen double shuffle2 = runiform() duplicates drop sort id (shuffle*) local cc_ratio 5 local i = 1 while `i' < _N { quietly count if id == id[`i'] local npicks = min(`r(N)', `cc_ratio') forvalues ii = 0/`=`npicks'-1' { drop if id_ctrl == id_ctrl[`i'+`ii'] in `=`i'+`npicks''/L } drop if id == id[`i'] in `=`i'+`npicks''/L local i = `i' + `npicks' }

A couple of remarks: this is going to be very slow in a large data set because it crawls through the data set observation by observation and the commands inside the loop have -if- conditions that must be evaluated on every observation in the entire subset of observations that meet the -in- condition. In addition to the computational drawbacks, the inability to reuse control observations typically results in some cases not getting their full complement of matches, or even getting none at all. The exclusion of those unmatchable cases may then lead to a biased analytic sample because it is going to be cases with less common values of the match variables that are selectively removed. If there were some major statistical advantage to sampling without replacement, that might warrant its use. But there isn't. It just seems to satisfy some people's esthetic preferences. So I advise against doing this.
Comment

Announcement

Matching Case-Control Code Needed

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment