Matching cases and controls based on age and gender

Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#1

Matching cases and controls based on age and gender

20 Sep 2016, 20:16

Hello all,
I have 1510 observations where there are 195 cases and 1315 controls. I want to match 1 case with 1 control based on age and gender. I have variables such as Id, Group (where 1=Case and 2=control), Sex(1=F and 2=M), age, and some other outcome variables which I want to compare between case and Control group. I have STATA 14 version. What would be the best way to do this? Any insight is highly appreciated.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29953

20 Sep 2016, 21:15

So, something like this:

Code:

use my_data, clear

// SEPARATE CASES FROM CONTROLS
//    AND DISTINGUISH VARIABLE NAMES
preserve
keep if group == 2
rename * *_control
rename age_control age
rename sex_control sex
tempfile controls
save `controls'

restore
keep if group == 1
rename * *_case
rename age_case age
rename sex_case sex

//    NOW JOIN ON AGE AND SEX
joinby age sex using `controls'

//    RANDOMLY SELECT ONE MATCH IF THERE ARE MORE
set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
gen double shuffle = runiform()
by case_id (shuffle), sort: keep if _n == 1
drop shuffle

The above will provide exact matches on age and sex. Now, in most real world situations, you won't be able to get enough matches with exact age. So typically people set some window, maybe 5 years, and require that the match be at least that close, if not exact. The code would be largely the same:

Code:

use my_data, clear

// SEPARATE CASES FROM CONTROLS
//    AND DISTINGUISH VARIABLE NAMES
preserve
keep if group == 2
tempfile controls
save `controls'

restore
keep if group == 1

//    NOW JOIN ON AGE AND SEX
//    ALLOW WINDOW FROM 5 YEARS BELOW TO 5 YEARS ABOVE
rangejoin age -5 5 using `controls', by(sex)

//    RANDOMLY SELECT ONE MATCH IF THERE ARE MORE
set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
gen double shuffle = runiform()
by case_id (shuffle), sort: keep if _n == 1
drop shuffle

Evidently, if you want a narrower or wider window, you can just change the -5 and 5 in the -rangejoin- command to whatever you like.

Note that when using -rangejoin-, it is unnecessary to rename variables as -rangejoin- will do it for you automatically.

To run the second version, you need to have the -rangejoin- command installed. It was written by Robert Picard and is available from SSC. -ssc install rangejoin-

Comment

Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#3

21 Sep 2016, 06:02

Good morning,
Thank you for the codes. I am trying to run the second set of codes with rangejoin. It runs just fine till restore but when I do rangejoin age -5 5 using 'controls', by (Sex) it says invalid syntax. Any idea how I can correct it? I am using my phone to type this message that's why the symbols surrounding the word controls is like that. I used the appropriate symbol while typing in the code as shown above. Thank you for all your help.
1 like
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#4

21 Sep 2016, 06:52

This is what I have done so far and got stuck.

. ssc install rangejoin, replace
checking rangejoin consistency and verifying not already installed...
all files already exist and are up to date.

. preserve

. keep if Group == 2
(195 observations deleted)

. tempfile controls

. save `controls'
file C:\Users\ACPR02~1\AppData\Local\Temp\ST_0t000002.t mp saved

. restore

. keep if Group == 1
(1,315 observations deleted)

. rangejoin age -5 5 using `controls', by(Sex)
invalid syntax
r(198);

my data description is given below:

obs: 1,510
vars: 40 21 Sep 2016 07:42
size: 558,700
---------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
---------------------------------------------------------------------------------------------
sn int %8.0g SN
Center long %8.0g Center Center
patientno str12 %12s Patient No
id float %9.0g
Group long %8.0g Group Group
age byte %8.0g Age
Sex long %8.0g Sex Sex
race str16 %16s Race
Race long %16.0g Race Race
comorbids str69 %69s Co-morbids
irreversiblea~n str50 %50s Irreversible Anticoag medication
Anticoag long %50.0g Anticoag Irreversible Anticoag medication
Aspirin float %9.0g Yes_No Aspirin Only
transfus1 byte %8.0g Pre-op No. of Transfusions
Product1 long %8.0g Product1 Pre-op Type of Blood Product
Volume1 byte %8.0g Pre-op Vol of Each Type of Blood Product (in
Units)
Method1 str3 %9s Pre-op Method to control bleeding? (Mechanical
or Topical agent)
lengthOP int %8.0g Length of operation (min)
bloodloss int %8.0g Blood loss (in cc)
transfus2 byte %8.0g Intra-op No. of Transfusions
Product2 long %13.0g Product2 Intra-op Type of Blood Product
Volume2 long %8.0g Volume2 Intra-op Vol of Each Type of Blood Product (in
Units)
Method2 long %31.0g Method2 Intra-op Method to control bleeding?
(Mechanical or Topical agent)
Reversal1 long %9.0g Reversal1
Intra-op Reversal agent given? If so, specify
type and volume
transfus3 byte %8.0g Post-op No. of Transfusions
Product3 long %8.0g Product3 Post-op Type of Blood Product
Volume3 long %8.0g Volume3 Post-op Vol of Each Type of Blood Prodct (in
Units)
methodtocontr~a str16 %16s Method to control bleeding? (Mechanical or
Topical agent)
Method3 long %16.0g Method3 Post-op Method to control bleeding? (Mechanical
or Topical agent)
Reversal2 str3 %9s Post-op Reversal agent given? If so, specify
type and volume
SSI byte %8.0g Surgical Site Infection (SSI) If Yes, how many
times?
DSI byte %8.0g Deep Space Infection (DSI) If Yes, how many
times?
OSI byte %8.0g Organ Space Infection (OSI) If Yes, how many
times?
LOS byte %8.0g Length of stay (days)
Readmission byte %8.0g # of Readmissions within 30 days
causeofeach30~n str44 %44s Cause of each <30 day re-admission
Mortality float %9.0g Yes_No 30 day Mortality
Cause long %44.0g Cause Cause of each <30 day re-admission
anycomplicati~s str72 %72s Any Complications?
complication byte %8.0g Yes_No Any Complication
---------------------------------------------------------------------------------------------
Sorted by: Group
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#5

21 Sep 2016, 09:12

Greetings,

The code for the 1:1 matching based on joinby command worked (I got a pool of 168 patients matched to 168 controls). I havent figured out where I made the mistake on rangejoin command and it would be nice to know for future reference as well.

I do have a question about random sampling and will be grateful if anyone can point me in the right direction.

Same scenario as before. I have 195 cases and 1315 controls. In the case group (N=195), The Mean age is 62 years (SD=13; range= 25-90years and there are 91 females (46.67%) and 194 Males (53.33%). In the control group (N=1315), Mean age is 37 years (SD=14; range= 18-87years and there are 489 females (37.19%) and 826 Males (62.81%). Now I want to randomly select controls such that there is no statistically significant between two groups on variables age and Sex. That is the only criteria. basically randomly select from control group so that there is no statistically significant difference in Age and Sex variable between two groups. basically we want to keep as many patients as possible, just eliminate the outlier patients randomly through the computer program so that in the remaining pool of total patients there is no statistically significant difference in Age and Sex between Cases and Controls. With the current data the p-value for age between two groups is <0.01 and p-value for Sex comparison (using chi2 test) between two groups is 0.011. We want to randomly select the sample so that p-value is >0.05 in these two variable when we run the analysis.
Thank you for any insight.
Sincerely,
Priyanka
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#6

21 Sep 2016, 09:13

PS: I closed the parenthesis and it converted to the wink symbol
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#7

21 Sep 2016, 09:43

Now I want to randomly select controls such that there is no statistically significant between two groups on variables age and Sex.

But you should not want to do that.

The purpose of matching is to create comparison groups that are as similar as you can get them on important variables that would otherwise confound whatever analyses you are going to do. Statistical significance has nothing whatsoever to do with that. What you should be aiming for is a close match. Let's focus on just one of your variables, say age. Ultimately you want to compare the cases and controls on some exposure variable (the outcome in your regression model, or perhaps you will be doing cross-tabulations, whatever.) The only reason to care about the ages being different in the two groups is that if age is, itself, related to exposure, then any difference or lack of difference in the case and control exposure rates may be attributable to the age difference. If the association of age with exposure is really strong, then you need the ages to be very closely matched in order to exclude that possibility. If the relationship between age and the exposure is very mild, then moderately large age differences still would not materially bias your comparison of the cases and controls.

So the point is that the closeness of matching needed is that which assures that the remaining difference could only account for a negligible part of difference in the exposure comparison of the cases and controls. A non-statistically significant difference is neither a necessary nor a sufficient condition for that. You shouldn't even calculate p-values for the differences in age and sex between your groups, let alone use them as a criterion. They're completely irrelevant in this context.

As for your getting a syntax error with -rangejoin-, the problem is that you are using the wrong opening quote mark around controls. It should be `controls', not 'controls'. This the way Stata references local macros, something you will need to be very familiar with to make effective use of Stata. The opening quote character is found, on US keyboards, on the key immediately to the left of the 1! key. It is different from the character on the key to the right of the :; key. (I'm sorry if these landmarks are not applicable to your keyboard.)

Notwithstanding the above, written in response to one place where I saw you type 'controls', I see that in the code and response you posted, you actually did use `controls', correctly. So that is not the source of that error. I don't see anything wrong with that syntax, and I use that construction often in my work without difficulty. I have one hunch. For this sequence of commands, which involves local macros, you must run all of the commands in a single sesssion. If you were highlighting chunks of the code in the do-file editor and then running them separately, you will get this error. The key is that the local macro controls is defined in the -tempfile- statement. If you run the -tempfile- command and then stop before you get to the -rangejoin- command, the local macro controls becomes undefined. Then when you try to execute -rangejoin age - 5 5 using `controls', by(sex)-, the undefined local macro controls translates to an empty string so that Stata thinks you are asking for -rangejoin age - 5 5 using, by(sex)-, which is, indeed a syntax error. If you run all the commands from -tempfile- through -rangejoin- without any interruption I think this problem will be resolved.

Last edited by Clyde Schechter; 21 Sep 2016, 09:50.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#8

21 Sep 2016, 09:59

Thank you for your input. I have been trying to explain the same concept on matching and selecting sample to the PI of this dataset but haven't been successful. They have been very adamant about selecting based on p-value and I was trying to double check in case I missed the whole concept and if there is any weight to the concept of selecting based on p-value. So I really want to thank you for the reiteration of the concept.

As for rangejoin, I did as you said with the symbol. and it doesnt work. I did the following:
. preserve

. keep if Group == 2
(195 observations deleted)

. tempfile controls
. save `controls'
file C:\Users\ACPR02~1\AppData\Local\Temp\ST_0t000002.t mp saved

. restore

. keep if Group == 1
(1,315 observations deleted)

. rangejoin age -5 5 using `controls', by(Sex)
invalid syntax
r(198);
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#9

21 Sep 2016, 12:04

Please post a sample of your data using the -dataex- command (-ssc install dataex-) so I can try to troubleshoot this. Did you also follow my advice to be sure that you are running all of this code as a single block, not line by line. I have used this syntax repeatedly in my work and never encountered this problem.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#10

21 Sep 2016, 13:23

It finally worked! I solved the problem of rangejoin command that I encountered previously by running block code but I kept getting another error and that was solved when I increased the range to -10 10. Thank you so much Dr. Schechter.

Also, Now I have a very novice question regarding the analysis of variables. Suppose I have a variable blood loss (measured in cc), that I want to compare between cases and controls. In my previous data set where I had group = 0/1 all together I would just use ranksum bloodloss, by (group). Now in this new dataset created, in a single observation I have bloodloss (for case) and bloodloss_U (for control). Do I basically recreate the dataset where controls are stacked such that there is only one Group variable one blood loss variable and so on or do I follow entirely different analysis technique? If you can just give me a direction I'll figure out the rest myself. Once again I really appreciate your help.

Here's the sample of my both the dataset, before match and after match

Original:
input float id long Group byte age long(Sex Race) int(lengthOP bloodloss)
39 1 52 1 3 60 15
1449 0 40 1 4 56 10
739 0 23 1 3 50 10
689 0 55 1 2 28 5
383 0 19 1 4 86 20
1067 0 33 1 4 43 15
1378 0 19 1 4 25 15
898 0 19 1 4 30 5
1001 0 39 1 4 74 10
1040 0 23 1 1 53 5
end

------------------ copy up to and including the previous line ------------------

Listed 10 out of 1510 observations

New Dataset
clear
input float id long Group byte age long(Sex Race) int(lengthOP bloodloss) float id_U long Group_U byte age_U long Race_U int(lengthOP_U bloodloss_U)
1 1 65 1 1 47 75 570 0 64 1 50 10
2 1 82 1 3 31 10 254 0 87 1 61 .
3 1 62 2 1 55 5 1069 0 65 3 50 15
4 1 44 2 4 30 10 1003 0 45 4 63 10
5 1 82 1 6 . 5 889 0 84 4 37 5
6 1 70 2 3 30 3 931 0 62 4 39 5
7 1 61 1 3 31 20 883 0 59 3 47 10
8 1 28 1 4 25 10 1254 0 29 3 38 20
9 1 60 1 4 50 20 1113 0 66 3 27 5
10 1 90 2 6 . 5 1347 0 80 . 57 30
end
------------------ copy up to and including the previous line ------------------

Listed 10 out of 195 observations

.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#11

21 Sep 2016, 13:40

It finally worked! I solved the problem of rangejoin command that I encountered previously by running block code but I kept getting another error and that was solved when I increased the range to -10 10. Thank you so much Dr. Schechter.

Also, Now I have a very novice question regarding the analysis of variables. Suppose I have a variable blood loss (measured in cc), that I want to compare between cases and controls. In my previous data set where I had group = 0/1 all together I would just use ranksum bloodloss, by (group). Now in this new dataset created, in a single observation I have bloodloss (for case) and bloodloss_U (for control). Do I basically recreate the dataset where controls are stacked such that there is only one Group variable one blood loss variable and so on or do I follow entirely different analysis technique? If you can just give me a direction I'll figure out the rest myself. Once again I really appreciate your help.

Here's the sample of my both the dataset, before match and after match

Original:
input float id long Group byte age long(Sex Race) int(lengthOP bloodloss)
39 1 52 1 3 60 15
1449 0 40 1 4 56 10
739 0 23 1 3 50 10
689 0 55 1 2 28 5
383 0 19 1 4 86 20
1067 0 33 1 4 43 15
1378 0 19 1 4 25 15
898 0 19 1 4 30 5
1001 0 39 1 4 74 10
1040 0 23 1 1 53 5
end

------------------ copy up to and including the previous line ------------------

Listed 10 out of 1510 observations

New Dataset
clear
input float id long Group byte age long(Sex Race) int(lengthOP bloodloss) float id_U long Group_U byte age_U long Race_U int(lengthOP_U bloodloss_U)
1 1 65 1 1 47 75 570 0 64 1 50 10
2 1 82 1 3 31 10 254 0 87 1 61 .
3 1 62 2 1 55 5 1069 0 65 3 50 15
4 1 44 2 4 30 10 1003 0 45 4 63 10
5 1 82 1 6 . 5 889 0 84 4 37 5
6 1 70 2 3 30 3 931 0 62 4 39 5
7 1 61 1 3 31 20 883 0 59 3 47 10
8 1 28 1 4 25 10 1254 0 29 3 38 20
9 1 60 1 4 50 20 1113 0 66 3 27 5
10 1 90 2 6 . 5 1347 0 80 . 57 30
end
------------------ copy up to and including the previous line ------------------

Listed 10 out of 195 observations

.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#12

21 Sep 2016, 13:41

While it is usually better to keep things in long layout for analysis in Stata, in this particular case, you are better off keeping the matched pairs wide. The command you are looking for is -signrank bloodloss = bloodless_U-.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#13

21 Sep 2016, 14:59

Thank you.

I have a random question and it might not make any sense but please bear with me. A simple yes no answer will suffice and if the answer is yes then some direction on how I'd achieve that will be great.

is it possible to give command to STATA basically saying here is a pool of control and here is a pool of case, please select as many control as you need to make control pool similar to case based on age and sex? Meaning without specifying match 1:1 or 1:n or exactly on gender and so on?

Thanks,
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#14

21 Sep 2016, 15:15

I know of no such command, and I doubt one could even exist. The problem is that the number of potential ways of pairing cases and controls is astronomically large (in your situation it's 1315¹⁹⁵,* and you are asking to optimize over that huge candidate solution space without providing any constraints on the search. Conceivably there might be some algorithm that would have a high probability of producing something close to an optimum and still run in time spans on the order of a human life. But I don't know of any.

There is also the problem that the optimization criterion is not even fully defined. Suppose we could get an exact match on age for every person, but gender was way off, or the other way around. Which would be better? And how would those compare to one that is a moderate match on both?

*Actually, the space is somewhat larger than that if we allow for the possibility that some cases may be left unmatched.
Comment
Priyanka Acharya

Join Date: Sep 2016

Posts: 28
#15

21 Sep 2016, 17:03

Yes! Thank you so much for your reply. I knew about what you are saying as matching by specifying certain criteria was what I have been taught. But I wanted to make sure that I am not missing any new innovation or something before I say I cannot do that
once again thank you very much for all your responses. They have been immensely helpful.
Comment

Announcement