How to find the controls from a subset of cases in an already case-control matched sample

Branu Ags

Join Date: Mar 2019

Posts: 11
#1

How to find the controls from a subset of cases in an already case-control matched sample

09 Mar 2019, 14:42

Hi, I am new to this forum but have been using STATA for a while for very basic statistical analysis. I would appreciate any help:
I have a dataset that has cases and controls matched on age, gender and number of years. So for every 18 year old female case with 2 years data, I have a 18 year old female control with 2 years of data.

I now want to divide my cases into 2 groups, say case1 and case 2. How do I separate the controls that match case 1 (and label them as control 1) and case 2(label them as control 2) so that I could present my analysis to compare age, gender, year matched case 1 with age, gender, year matched control 2, and so on. Please let me know if you need more info to help with this question. Thanks.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

09 Mar 2019, 17:47

To get a useful answer you need to show some example data, and you need to explain what variable or variables enable you to know, for a given observation, which observation it is matched to.

Please use the -dataex- command to show your example data. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Branu Ags

Join Date: Mar 2019

Posts: 11
#3

09 Mar 2019, 20:22

Thanks. Below is my data (sorry had to choose 20 random observations in order to show what I wanted to) :
input double id byte case float case1 int dobyr str1 sex byte num_years
1541626301 0 . 1947 "1" 2
29323325602 0 . 1957 "1" 1
2052746701 0 . 1965 "2" 4
2845525701 0 . 1975 "2" 6
3357809801 0 . 1975 "2" 4
29109149401 0 . 1976 "2" 5
1294197101 0 . 1977 "1" 4
1865647901 0 . 1980 "1" 1
1271583601 0 . 1980 "2" 7
27340945701 0 . 1981 "2" 2
1622326701 0 . 1988 "1" 7
1571984805 0 . 1992 "1" 2
3633020704 0 . 1998 "2" 1
652730001 1 0 1976 "2" 7
1061536804 1 0 1989 "2" 5
1367677901 1 0 1982 "1" 7
2426020402 1 1 1959 "1" 5
2630152401 1 0 1968 "2" 5
30161875404 1 1 1994 "2" 1
31813642702 1 0 1990 "2" 2

'case' if 1 is my case and if 0 is the control.
I have 480,000 total observations with 240K cases and 240K controls derived from a larger dataset, and matched on dobyr, sex and num_years

Below is what I want to do.
My cases are grouped into 2 subgroups (case1=1 if case is severe and case1=0 if case is mild)
Now I want to compare case1 (severe) and controls that match case1 on age, gender, and num_years and want to do the same for mild cases. Basically, if I remove my mild cases I want to remove all the controls that matches those cases, if that makes sense.

Honestly I don't know where to start. But I tried the following which increased my observations to some crazy 42 million observations and for every observation case was changed to 1. So I am sure this is not the right thing to do.

preserve
keep if case==0
tempfile controls
save `controls'

restore
keep if case==1

joinby dobyr sex num_years using `controls'

set seed 217
gen double shuffle = runiform()
by id_num (shuffle), sort: keep if _n == 1
drop shuffle

Look forward for suggestions.
Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

09 Mar 2019, 22:23

I don't understand. You said that you already had matched pair data. But the code you attempted is code to create matched pairs in the first place (and it is close to correct for that purpose, by the way.)

Please clarify. If the data really are matched pairs already, there should be some variable or variables in that data set that indicate, for each observation, what other observation it is paired to. You don't show any such variable(s). If they are not already matched pairs, then the task is simply to create them. In that case, the code you wrote just needs a little bit of modification:

Code:

preserve keep if case==0 rename id control_id tempfile controls drop case save `controls' restore keep if case==1 rename id case_id drop case joinby dobyr sex num_years using `controls' set seed 217 gen double shuffle1 = runiform() gen double shuffle2 = runiform() by case_id (shuffle1 shuffle2), sort: keep if _n == 1 drop shuffle*

You need two shuffle variables here because your data set is large enough that one might not be sufficient to give a unique sort order.
Comment
Branu Ags

Join Date: Mar 2019

Posts: 11
#5

10 Mar 2019, 05:24

I am sorry, I don't think I did a fair job explaining this. The above code didn't give the desired result. Let me try again.

I have 480k observations, 240K cases and 240K controls that are matched 1:1 on age, sex and num_years
I want to remove my mild cases, so I dropped 90K cases, which leaves me with 150k cases and 240K controls in the dataset.
Now what can I do so that my dataset has 150K controls that match those remaining 150k cases. In other words, how can I drop the 90K controls that are no longer matched in the dataset after I have dropped the 90K cases.

For e.g., in the above dataset, if I drop all observations for which case1==0, I will be left behind with case1==1 which are 150K cases but the controls are still 240K. I want to also deleted the matched control for all case1==0.

I guess I don't need to do any matching, I just need to be able to find the already matched control in the dataset if I make any changes to the case sample.

I do not have any variable specifically in the data set that indicate, for each observation, what other observation it is paired to. Is there a way to find the matched pair still?

Thanks.

Last edited by Branu Ags; 10 Mar 2019, 05:36.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

10 Mar 2019, 16:57

Well, if each possible combination of age sex and num_years uniquely identified a single case and a single control, you would be able to reconstruct the match. Test it, it might work:

Code:

isid case dobyr sex num_yrs label define case 0 "_control" 1 "case" label values case case decode case, gen(_case) drop case reshape wide id case1, i(dobyr sex num_yrs) j(_case) string

If that condition of unique identification is met, the first command will execute with no output. If it is not met, you will get an error message and execution stops there--and then there is no way to reconstruct which case is matched with which control.

If the condition is met, the remainder of the code will convert the data into a set of 120,000 paired observations. Each observation will be characterized by its dobyr sex and num_yrs, and you will have variables id_case and id_control. You will also have new variables case1_case and case1_control. At that point you can drop if case1_case == 0, and you will be left with the case1 == 1 cases and their associated controls.

That said, it would surprise me if in a data set this large there were only a single case and a single control corresponding to every combination of dobyr sex and num_yrs. So if the above approach fails, then I think the task you have set out to do is not possible with this data set: the information about which case is associated with which control has simply been obliterated and there is no way to recover it. In this situation, I would go back to the source of the data set and ask them to re-generate it in such a way that one of the variables id the paired_id: the id number of the case or control that is paired with the current observation, or to recreate the data set to look like the one the code shown here attempts to create: where each observation contains a pair of id's, one case and one control.
Comment
Branu Ags

Join Date: Mar 2019

Posts: 11
#7

10 Mar 2019, 22:01

Thanks. The code didn’t work for the reasons you already explained. I have asked for what you suggested to the source where I got the original data.

Thanks again. This was so helpful.
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#8

11 Mar 2019, 10:34

Branu, you've received great advice on the Stata solution to your question. I suggest you also talk to an epidemiologist as it's possible you should consider different questions.

If you have a standard (i.e., not nested) case-control study then I don't think you need to identify the individual matched pairs. Although your matching is done at an individual level, what you end up with is a "frequency matched" case-control study. For each combination of the matching variables you will have multiple cases and controls and the controls are interchangeable. Such a study can be analysed with unconditional logistic regression; you don't need to condition on the matched sets as you do in an individually matched case-control study. When you partition on severity, you can use all controls that match to a case. In short, if you have all covariate patterns within both the severe and non-severe then you can use all 240K controls for both analyses. Your data reduction exercise becomes simpler. Within each value of severity, keep all controls that have the same matching variables as at least one of the cases.

I'm not sure what 'years of data' is. If it is follow-up time, then you probably have a nested case-control study and things may be more complicated (although not a lot). If you do have a nested case-control study, I would be more concerned about whether the matching was done correctly since programmers often get it wrong. If the person who provided the data didn't provide you with an ID to identify the matched pairs (which would be good practice in general and vital in epidemiology because one conditions on the pair ID in the analysis) then I would check to make sure they have correctly identified the risk sets. For example, programmers not familiar with epidemiology will sometimes make erroneous restrictions such as only allowing indivuals to be a control to one case or not allowing individuals to be controls if they later become a case. If you do have a nested case-control study then you can easily check this assuming you have unique individual ID numbers (which you appear to have). Just check that you have an appropriate number of individual who are selected multiple times as controls and you have controls who later become cases.
1 like
Comment
Branu Ags

Join Date: Mar 2019

Posts: 11
#9

14 Mar 2019, 22:39

Makes so much sense. I will definitely take this into account. I have a standard case control study.
Comment

Announcement

How to find the controls from a subset of cases in an already case-control matched sample

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment