Matching cases and controls 1:5

Joe Tuckles

Join Date: Jul 2018

Posts: 180
#1

Matching cases and controls 1:5

22 Sep 2019, 20:23

Hi,

I would like to match cases and controls. I have got a variable saved as case where 1=case and 0=control. I have 168 cases and 860 controls. Hoping to match 1:5. I want to match on gender and age. The age range in my sample is 7 years, 8 months and 10 years 6 months.

Have tried the following code with just matching for gender as I am not sure how to write in matching for age as well... I do not get any errors in it, however it does not seem to work as at the end my case variable has disappeared and I am left with no variables relating to the matched cases and controls.

Code:

use "C:\Users\jtuckles\Downloads\dataset.dta", clear preserve keep if case drop case tempfile cases save `cases' restore drop if case drop case ds cdgender, not rename (`r(varlist)') =_ctrl tempfile controls save `controls' use `cases' joinby cdgender using `controls', unmatched(master) set seed 8846 gen double shuffle = runiform() duplicates drop by uniqueid (shuffle), sort: keep if _n <= 5 drop shuffle
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

22 Sep 2019, 21:14

A StataList search on /case control match/ revealed a number of relevant previous threads, a particularly relevant one of which is here:

I would say though, that if you want matching without replacement, I don't think you'll be successful in getting a 5:1 match with only slightly more than 5X as many potential controls as cases.

(Re your attempted solution: Something in that direction might work, in regard to which note that -joinby- permits a varlist, e.g. -joinby gender age using ...-. Your case variable disappeared because you dropped it from both of your files. Rather than work with what you have, though, I'd go with one of the approaches in the link above.)
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29953

22 Sep 2019, 21:16

When I create a toy data set that follows the data structure you describe, I am unable to reproduce your difficulty. The code works just fine:

Code:

. clear

. set obs 1028
number of observations (_N) was 0, now 1,028

. gen byte case = _n <= 168

.
. set seed 1234

. gen byte cdgender = runiformint(1, 2)

. gen int uniqueid = _n

. gen other_variable = runiform()

.
. preserve

. keep if case
(860 observations deleted)

. drop case

. tempfile cases

. save `cases'
file C:\Users\CLYDES~1\AppData\Local\Temp\ST_55c0_000002.tmp saved

. restore

. drop if case
(168 observations deleted)

. drop case

. ds cdgender, not
uniqueid      other_vari~e

. rename (`r(varlist)') =_ctrl

. tempfile controls

. save `controls'
file C:\Users\CLYDES~1\AppData\Local\Temp\ST_55c0_000003.tmp saved

.
. use `cases'

. joinby cdgender using `controls', unmatched(master)

.
. set seed 8846

. gen double shuffle = runiform()

. duplicates drop

Duplicates in terms of all variables

(0 observations are duplicates)

. by uniqueid (shuffle), sort: keep if _n <= 5
(71,448 observations deleted)

. drop shuffle

.
. summ

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
    cdgender |        840     1.47619    .4997303          1          2
    uniqueid |        840        84.5    48.52546          1        168
other_vari~e |        840    .5162136    .2924038   .0015615   .9930941
      _merge |        840           3           0          3          3
uniqueid_c~l |        840    601.4976    250.0266        170       1025
-------------+---------------------------------------------------------
other_vari~l |        840    .5054344    .2915343   .0001751   .9985999

Please repost including an example (use -dataex-) of your data that exhibits the problem, and also show the exact output you get from running your code, as well as the output from running -summ- at the very end.

Added: Crossed with #2. I will add to Mike Lacy's excellent insights that the disappearance of your case variable is not only expected with the code, but it is in no way a problem. Within the final file, you will have two variables: id and ctrl_id which will identify the id's of the case and the matched controls respectively. You are no longer in a layout where cases and controls are separate observations: each observation now contains a case and a control, and the variable names distinguish the information about them.

Last edited by Clyde Schechter; 22 Sep 2019, 21:18.

Comment

Joe Tuckles

Join Date: Jul 2018

Posts: 180
#4

22 Sep 2019, 22:15

Hi thank you for your responses.

OK there isn't an issue with the code the issue was with my lack of understanding. Although I now have 1,895 observations rather than your example of 840. Please can I clarify the following things:

1. How do I bring in the variable

Code:

cdage

into this? I want to match on age as well as gender.
2. How do I now do tests to see if there is a difference between cases and controls? For example if I want to see if cases have a higher mean BMI than controls? I now have two variables for BMI I have BMI and BMI_ctrl how do I do a paired t-test? The new variables and lack of one variable that defines cases and controls has thrown me. It now looks like my cases have been duplicated 5 times.
3. Is it better if I match 2:1? Or 3:1? Based on my sample size.

Last edited by Joe Tuckles; 22 Sep 2019, 22:17.
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

23 Sep 2019, 19:21

Hi, I'm having problems trying to also match on age. The age range is small - 7 years 8 months - 10 years 6 months. Since the age is in years.months no single participant has exactly the same age. So ideally would like to match with people in the range of 6 months - 1 year. However I am getting an error message:

Code:

. use "C:\Users\jtuckles\Downloads\dataset.dta", clear


. preserve

.
. keep if case
(860 observations deleted)

.
. drop case

.
. tempfile cases

.
. save `cases'
file C:\Users\jtuckles\AppData\Local\Temp\ST_2a1c_00000v.tmp saved

.
. restore

.
. drop if case
(379 observations deleted)

.
. drop case

.
. ds cdgender, not
uniqueid     variables

.
. rename (`r(varlist)') =_ctrl

.
. tempfile controls

.
. save `controls'
file C:\Users\jtuckles\AppData\Local\Temp\ST_2a1c_00000w.tmp saved

.
. use `cases'


.
. rangejoin cdage -1 1 using `controls', by(cdgender)
  (using rangestat version 1.1.1)
invalid syntax
r(111);

.
. set seed 8846

.
. gen double shuffle = runiform()

.
. duplicates drop

Duplicates in terms of all variables

(0 observations are duplicates)

.
. by uniqueid (shuffle), sort: keep if _n <= 3
(0 observations deleted)

.
. drop shuffle

. clear

Last edited by Joe Tuckles; 23 Sep 2019, 19:50.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#6

26 Sep 2019, 09:54

Well, you can't just do a paired ttest on BMI and ctrl_BMI here. That would be fine with 1:1 matching. But you have the same cases being used with each of 5 observations, so the observations are not independent. You need to change the data layout so that cases and controls are separate observations, but linked by a variable that shows which case each control is attached to, and another variable identifying which are cases and which are controls. Then you can -xtset- with the caseid variable, and you can emulate the closest thing to a paired ttest using -xtreg, fe-.

There is no requirement that you have the same number of controls for each case. If you would like to have 5 controls per case but find that not all get completely matched, it is not a problem. If you have an aesthetic preference for the same number of controls per case and you are getting too many unmatched cases with 5:1 matching, then by all means try something less demanding like 3:1 or 2:1. There's no principled answer here. Just do what works.

Your description of your age variable is not complete. I cannot tell if you have two variables, one for years and the other for months (0-11) or a single variable that gives the age in months. In the code below, I assume the former, and from that I calculate cdage to be the age in months.

The error message you are getting from -rangestat- is misleading. There is no syntax error. But there is a problem. -joinby- does not create prefixes for variables in the -using- set, so the -`controls'- tempfile had to have the variables renamed with a prefix. But -rangestat- does not want different names in the -using- file: it wants the same variable names, and it supplies a prefix (U_ by default, or your can specify it as an option). The problem -rangestat- is encountering is that there is no variable cdage in the -using- data set because it was renamed to -ctrl_cdage-. As it happens, that mismatch in names ends up leading to a syntax error in a command deep inside -rangestat-. Anyway, the solution is to not rename the variables in the -`controls'- tempfile and just let -rangestat- supply the prefixes. All in all, it looks like this:

Code:

clear set obs 1028 gen byte case = _n <= 168 set seed 1234 gen byte cdgender = runiformint(1, 2) gen int uniqueid = _n gen byte age_yrs = runiformint(7, 10) gen byte age_mos = runiformint(0, 11) if !inlist(age_yrs, 7, 10) replace age_mos = runiformint(8, 11) if age_yrs == 7 replace age_mos = runiformint(0, 6) if age_yrs == 10 gen cdage = 12*age_yrs + age_mos gen BMI = runiform() preserve keep if case drop case tempfile cases save `cases' restore drop if case drop case // ds cdgender, not // rename (`r(varlist)') =_ctrl tempfile controls save `controls' use `cases' rangejoin cdage -6 6 using `controls', by(cdgender) prefix(ctrl_) set seed 8846 gen double shuffle = runiform() duplicates drop by uniqueid (shuffle), sort: keep if _n <= 5 drop shuffle // GO TO LAYOUT WITH CASES IN SEPARATE OBSERVATIONS clonevar caseid = uniqueid preserve ds caseid cdgender ctrl_* keep `r(varlist)' rename ctrl_* * gen case = 0 save `"`controls'"', replace restore drop ctrl_* gen case = 1 assert uniqueid == caseid append using `controls' sort caseid case uniqueid xtset caseid xtreg BMI i.case, fe

Last edited by Clyde Schechter; 26 Sep 2019, 09:58.
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

29 Sep 2019, 20:10

Hi Clyde,

Thank you for your help. The variable for age is set in years.months, for example one participant has the age written: 8.528639, where they are 8 years old and 528639 months (I realise that sounds very odd). Do I simply multiple this variable by 12? I tried this and generated a new variable, which I have named age_months. However I am getting the following error:

Code:

. gen age_months = cdage*12

. preserve

. keep if case
(860 observations deleted)

. drop case

. tempfile cases

. save `cases'
file C:\Users\jtuckles\AppData\Local\Temp\ST_4790_00000t.tmp saved

. restore

. drop if case
(379 observations deleted)

. drop case

. ds cdgender, not
uniqueid      variables etc

. rename (`r(varlist)') =_ctrl

. tempfile controls

. save `controls'
file C:\Users\jtuckles\AppData\Local\Temp\ST_4790_00000u.tmp saved

. use `cases'

. rangejoin age_months -6 6 using `controls', by(cdgender) prefix(ctrl_)
  (using rangestat version 1.1.1)
invalid syntax
r(111);

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#8

29 Sep 2019, 20:20

However I am getting the following error:

This issue was dealt with in #6. Please re-read that post. When doing -rangejoin- you must not add the ctrl_ prefix to the variable names in the control file. That -rename (`r(varlist)') ctrl_=- command has to go. If you read the code in #6, you will notice that there this command is preceded by two forward slash characters. That means it's commented out: it's not executed. -rangejoin- itself will provide the prefix instead. The presence of that prefix in the control data set prevents -rangejoin- from finding the variable age_months in the control data set, and that, deep inside the code of -rangejoin- causes a certain command to have a syntax error.

The variable for age is set in years.months, for example one participant has the age written: 8.528639, where they are 8 years old and 528639 months (I realise that sounds very odd).

That isn't possible. 528639 months is longer than any human lifespan has been. It's almost geologic time units. My intepretation would be that the .528639 is a fraction of the year. If you multiply 8.528639 by 12 you will get the age in months, in this case 102.34367. But that is very problematic for matching because it is not an integer. So I think I would take another step and either round it to the nearest integer or truncate it down before trying to use it for matching. (-help round()- or -help floor()-)

That said, this is peculiar enough that you should probably double-check the meaning of this variable with whoever provided you with the data. If they didn't provide documentation that explains it, then I suggest you contact them.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#9

29 Sep 2019, 20:26

Ah apologies I did not realise the purpose of the // and had edited them out.
Good idea I will contact the researchers. Just realised that would make that person over 44,000 years old HA!
Comment

Joe Tuckles

Join Date: Jul 2018
Posts: 180

#10

29 Sep 2019, 22:28

Hi,

I have completed the following coding:

Code:

. replace c1dage = round(cdage, 0.1)
(1,239 real changes made)

. browse

. replace age_months = cdage*12
(1,239 real changes made)

Then followed your coding as above.

My uniqueid number contains letters in it so it is a string variable. As a result I have done the following coding and it has produced the following results, could you advise if this is correct:

Code:

egen panel = group( caseid)

. xtset panel
       panel variable:  panel (balanced)

. xtreg bmi i.case, fe

Fixed-effects (within) regression               Number of obs     =        861
Group variable: panel                           Number of groups  =        370

R-sq:                                           Obs per group:
     within  = 0.0305                                         min =          1
     between = 0.0160                                         avg =        2.3
     overall = 0.0198                                         max =          4

                                                F(1,490)          =      15.43
corr(u_i, Xb)  = 0.0076                         Prob > F          =     0.0001

------------------------------------------------------------------------------
         bmi |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      1.case |   2.436372   .6203375     3.93   0.000     1.217522    3.655221
       _cons |   20.91224   .2932383    71.31   0.000     20.33608     21.4884
-------------+----------------------------------------------------------------
     sigma_u |  5.9605618
     sigma_e |  6.5845452
         rho |  .45038357   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(369, 490) = 2.33                    Prob > F = 0.0000

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#11

29 Sep 2019, 22:38

Looks good.
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#12

29 Sep 2019, 22:40

Thank you, apologies but could advise does this show the BMI is higher in cases or controls? Is there additional coding for me to find that out? I can see it is significantly different.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#13

29 Sep 2019, 22:52

It is higher in the cases. The coefficient of 1.case is E(BMI | case == 1) - E(BMI | case == 0). In your results that's a positive number.

I can see it is significantly different.

Yes you can, but you shouldn't look at it that way. The American Statistical Association has recommended that significance tests no longer be used. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr. Instead, report your results as the coefficient (expected difference in BMI) along with its standard error or the 95% CI. Show the p-value too if you wish. But don't dichotomize it into "statistically significant" or "not statistically significant." Rather, state whether you think a difference of about 2.4 units of BMI (kg/m²?) is clinically meaningful or meaningful from a public health perspective. And if so, would it still be meaningful in those terms if it were really only 1.2 (the lower 95% confidence limit)?
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#14

29 Sep 2019, 22:55

Thank you, that's very helpful information. I'm sorry to ask another question but is it possible to control for a cluster variable (that is categorical) using this statistical test or would I need to move on to clogit?
I also have a categorical variable I wish to examine - smoking with 1= yes ever smoked and 0= no never smoked. Can this test be used for this variable? Or would it be better to just do a clogit with both variables included
Many thanks

Last edited by Joe Tuckles; 29 Sep 2019, 23:34.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#15

30 Sep 2019, 12:34

Thank you, that's very helpful information. I'm sorry to ask another question but is it possible to control for a cluster variable (that is categorical) using this statistical test or would I need to move on to clogit?

Yes, you can add other variables to the regression to adjust for their effects.

I also have a categorical variable I wish to examine - smoking with 1= yes ever smoked and 0= no never smoked. Can this test be used for this variable? Or would it be better to just do a clogit with both variables included

Do you mean "examine" it as an outcome variable? Or do you just want to include it as another predictor of BMI? The choice between a linear regression model and a logistic model really doesn't depend on the nature of the predictors, only the nature of the outcome.

Linear probability models can be very useful. They provide estimates of risk-differences rather than odds ratios. Generally for policy and public health purposes, risk-differences are more useful than odds ratios. The drawback of a linear probability model is that it can result in predicted probabilities outside the [0, 1] range. However, if the distributions of your variables, and the prevalence of the outcome, are supportive, this may not happen (or may be very rare) in a particular data set. The only way to know for sure is to try it and see.
Comment

Announcement