Matching cases and controls 1:5

Joe Tuckles

Join Date: Jul 2018

Posts: 180
#16

30 Sep 2019, 19:41

So I just want to see whether cases have higher rates of multiple different variables compared to controls. So do cases have a higher BMI compared to controls? Do cases have higher rates of smoking compared to controls? etc. I have both continuous and categorical variables. I'm not sure whether to examine each variable individually as I have done above for BMI using xtreg, or put them all into a model (using xtreg or clogit?). Either way I will need to control for a categorical cluster variable and will probably need to control for variables such as age, gender and SES. Can you advise based on this what you think sounds most sensible?

I should also add that some of these predictors have about 20-30% missing data, mostly all from the same people (i.e. data is missing across every single variable for 20-30% of participants, except for the variable that defined whether they were cases or controls). I am not sure whether to remove these people entirely and redo the matching or keep them in as they can be identified as a case/control but there is just no data on their predictor variables.

Many thanks for your continued help.

Last edited by Joe Tuckles; 30 Sep 2019, 20:04.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#17

30 Sep 2019, 20:03

Well, it sounds like BMI, smoking, etc. are dependent variables. So you would want to do a separate regression for each one. By contrast, the variables you want to adjust for (you don't actually control for variables except in an experimental context where you actually do control them) are independent variables and can be entered together on the right hand side. So it will look something like this:

Code:

xtreg BMI i.case age i.sex i.ses, fe xtlogit smoking i.case age i.sex i.ses, fe // etc.

(Note: you might be able to do a linear probability model (-xtreg- instead of -xtlogit- for smoking or some of your other outcome variables. But if you can't do it for all of them, it will probably be best to just use -xtlogit- for all of them so that the results will be easier to understand and require less detailed explanation of technical matters to audiences that probably aren't really interested in technical matters.)

In any kind of regression analysis, observations that have missing values for any of the variables that are mentioned in the regression command are automatically excluded from the estimation sample. So you don't have to specifically exclude those cases and controls with missing data. That will happen on its own. In fact, there is no way to keep them in!
Comment
Joe Tuckles

Join Date: Jul 2018

Posts: 180
#18

30 Sep 2019, 20:06

Fantastic thank you!
Comment

Paula Ramirez

Join Date: Feb 2021
Posts: 2

#19

09 Feb 2021, 18:42

Originally posted by Clyde Schechter View Post

When I create a toy data set that follows the data structure you describe, I am unable to reproduce your difficulty. The code works just fine:

Code:

. clear

. set obs 1028
number of observations (_N) was 0, now 1,028

. gen byte case = _n <= 168

.
. set seed 1234

. gen byte cdgender = runiformint(1, 2)

. gen int uniqueid = _n

. gen other_variable = runiform()

.
. preserve

. keep if case
(860 observations deleted)

. drop case

. tempfile cases

. save `cases'
file C:\Users\CLYDES~1\AppData\Local\Temp\ST_55c0_000002.tmp saved

. restore

. drop if case
(168 observations deleted)

. drop case

. ds cdgender, not
uniqueid other_vari~e

. rename (`r(varlist)') =_ctrl

. tempfile controls

. save `controls'
file C:\Users\CLYDES~1\AppData\Local\Temp\ST_55c0_000003.tmp saved

.
. use `cases'

. joinby cdgender using `controls', unmatched(master)

.
. set seed 8846

. gen double shuffle = runiform()

. duplicates drop

Duplicates in terms of all variables

(0 observations are duplicates)

. by uniqueid (shuffle), sort: keep if _n <= 5
(71,448 observations deleted)

. drop shuffle

.
. summ

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
cdgender | 840 1.47619 .4997303 1 2
uniqueid | 840 84.5 48.52546 1 168
other_vari~e | 840 .5162136 .2924038 .0015615 .9930941
_merge | 840 3 0 3 3
uniqueid_c~l | 840 601.4976 250.0266 170 1025
-------------+---------------------------------------------------------
other_vari~l | 840 .5054344 .2915343 .0001751 .9985999

Please repost including an example (use -dataex-) of your data that exhibits the problem, and also show the exact output you get from running your code, as well as the output from running -summ- at the very end.

Added: Crossed with #2. I will add to Mike Lacy's excellent insights that the disappearance of your case variable is not only expected with the code, but it is in no way a problem. Within the final file, you will have two variables: id and ctrl_id which will identify the id's of the case and the matched controls respectively. You are no longer in a layout where cases and controls are separate observations: each observation now contains a case and a control, and the variable names distinguish the information about them.

"

Hello, Clyde. I have a question about the topic already mentioned by user Crossed # 3. In my case-control study, I used a 1: 4 matching, but after following the instructions, the variable that indicates (case 0-control 1) remains with only one category (only cases). Thus, it is impossible to perform the panel data regression (xtreg) Any suggestions?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#20

09 Feb 2021, 19:24

Yes, the code you show produces a layout that is not quite ready for panel data regression. That is because each case appears in 5 observations, in each paired with one of the matched controls. A bit more data management is required to get to panel data structure. The following additional commands get you there:

Code:

drop _merge gen long pair_num = _n gen match_group = uniqueid rename (uniqueid other_variable) =_case reshape long uniqueid other_variable, i(pair_num) j(_case_control) string by match_group _case_control, sort: drop if _case_control == "_case" & _n > 1 xtset match_group encode _case_control, gen(case_control) drop _case_control

You now have a data set with match_group as the grouping variable, and case_control as an indicator of case vs control status.
1 like
Comment
Paula Ramirez

Join Date: Feb 2021

Posts: 2
#21

10 Feb 2021, 05:28

Thanks a lot! We will go through your indications and let you know how it works.
Comment
javes omwansa

Join Date: May 2022

Posts: 2
#22

26 May 2022, 20:19

Dear Clyde Schechter, I have seen your codes for matching a case to multiple controls but repeatedly when I used them, I got results that I could not explain. For instance, 1. At the end of matching my file seems to be having only cases in replica, with no trace of the controls. My data set has a total of 569 obs (189 cases and 380 control) I needed to match them into 1;2 cases: control respectively by age and gender. However, I end up with around 251-259 total obs containing only cases in replicas. I was expecting to do a matching without replacement since I have enough controls for each case but I have not been successful could you give me some ideas please (if possible the command) Also what result that is outcome of matching should I expect to be a sucess I assume that after matching I should have 189 sets of 1 case and 2 controls (189 sets each containing 3 individuals) thank you.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment