help with data management for analysing timing of event occurence (matched cohort design)

Orit Schieir

Join Date: May 2015

Posts: 18
#1

help with data management for analysing timing of event occurence (matched cohort design)

26 May 2015, 17:51

Hi,

I would greatly appreciate some guidance with stata code that would help prepare my dataset for analysis on timing of event occurrence.

Data structure: longitudinal panel data from a population based survey with 16 years of follow-up. Information on chronic conditions and lifestyle factors are updated every 2 years with up to 8 discrete waves of data collection for each participant.

Research questions:

Do persons with arthritis have a higher prevalence of heart disease and major risk factor prior to onset of arthritis (index date) compared to the general population? (separating time in to 2 years prior to arthritis diagnosis, and any time prior to arthritis diagnosis).

Is arthritis an independent risk factor for incident heart disease and major risk factors for heart disease in the general population (excluding persons with these prevalent conditions)?

What I would like to do ideally is create a matched cohort design, where the first panel wave where a participant reports having arthritis is considered the "index date" for matching purposes and then a ratio of controls (could be 3 or 2:1) are selected for each arthritis case matching on age, sex and time in study.

Thanking you advance

Orit
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#2

26 May 2015, 18:03

Well, you don't tell us much about the variables in the data set, such as how you would tell if a patient has a diagnosis of arthritis at a given time. For illustrative purposes, I'll assume there is a variable, arthritis, that takes on 0/1 values to indicate that (and may also be missing). I also assume you have a date variable, called date, which is never missing and is a numeric variable encoding a Stata date, not a string variable. I assume there is a unique identifier for each person that is consistent over the waves of the data set, call it id. Again, lacking any description from you, I will assume that the data are in long layout, that is, there is a separate observation in the data set for each wave within each person. Then:

Code:

egen date_first_arthritis_dx = min(cond(arthritis==1, date, .)), by(id)

Note: For participants who never have arthritis = 1, this code will set date_first_arthritis_dx to missing. For all others it will show the earliest date at which arthritis = 1 for that participant.

The -egen- functions are indispensable for data management in Stata. Familiarizing yourself with them in the manual will amply repay your time and effort.
1 like
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#3

27 May 2015, 07:36

Hi,

Thanks Clyde for your quick reply. The assumptions you make above are correct sorry if the information was insufficient in my original port. I have actually coded the first occurrences in the panel data already using commands similar to what you describe above. Where I really need help is on the matching aspect for the first question.

How would I tell stata to match a ratio of controls to each case that I have already identified? i.e. a person is followed forward starting at time 0, they report arthritis for the first time at the 4th visit, how can I select a ratio of controls for that case matching on age, sex and time? I would then want to estimate if persons with arthritis have a higher prevalence of specific chronic conditions in the 2 years prior to diagnosis (so really just the cycle preceding the first cycle the condition is reported to be present for the first time) as well as any history of the chronic conditions prior to diagnosis cycle, respectively.

Thanks!

Orit
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#4

27 May 2015, 07:37

clarification: I would then want to estimate if persons with arthritis have a higher prevalence of specific chronic conditions in the 2 years prior to diagnosis (so really just the cycle preceding the first cycle the condition is reported to be present for the first time) as well as any history of the chronic conditions prior to diagnosis cycle, respectively, relative to controls.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#5

27 May 2015, 09:06

When you say match on age, sex, and time in study, I assume you want an exact match on sex. But how close a match in age is acceptable? And how close a match on time in study?
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#6

27 May 2015, 09:38

You refer to "date of first arthritis diagnosis". Do you have the exact date? Or do you have only the presence at the date of one of the two-year followups?

Terminology: you don't appear to have a "matched-cohort" design, but rather a matched case-control design nested in a cohort study. See, in contrast, Cummings & McKnight, 2004, or de Kraker, 2010, who retrospectively constructed two matched cohorts.

Reference:
Cummings, P., & McKnight, B. (2004). Analysis of matched cohort data. Stata Journal, 4, 274-281. http:// http://ageconsearch.umn.edu/bitstream/116248/2/sjart_st0070.pdf

M de Kraker (2010) Parallel Matched Cohort Design Has Advantages over a Standard Cohort Design when Estimating the Burden of Methicillin Resistant S. aureus Blood Stream Infectionsj: https://shea.confex.com/shea/2010/we...Paper1818.html

Last edited by Steve Samuels; 27 May 2015, 10:22.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#7

28 May 2015, 09:42

Hi Clyde and Steve,

My data is longitudinal panel data with information updated at 2-years at discrete time periods so I do not have an exact date of diagnosis, rather the first wave number that the participant first reports a diagnosis of arthritis is assumed to be the "index date" for the purposes of my analysis. I would be interested in matching exactly on time in study but age in 5 year age bands would be fine. I assume I could create a new categorical age variable in 5 year bands and use that for the matching. I thought about it and think I will forgo matching on sex since I think it would be interesting to perform a stratified analysis by sex to see if there are gender differences in effects of arthritis on comorbidity.

Steve, you are quite right that my first question is a nested case-control analysis and second is a cohort analysis. For the case-control part, does stata have functions for incidence density sampling or matching? would sttocc be appropriate?

Just to summarize I would use studytime I would want to match on index date (min(cond(arthritis==1, time in waves , .) and age in 5 year age bands.

How do I create an outcome variable representing prevalence of a specific chronic condition in the 2 year period prior to index date?

My ideal would be to define separate risk periods based on time to estimate heart disease comorbidity. Please excuse my crude diagram below

Dotted line (follow up)

x person reports arthritis for the first time

0 a control with no record of arthritis in the entire study period from 1994/95 to 2010/11 selected to be a control at the same time as arthritis case is identified matched on age (5 year age bands)

CC - first occurrence of chronic cormorbidity

<--> risk periods to estimate odds of prevalent comorbidity in arthritis vs. controls ( seperately for 2 years prior to arthritis index date, & any time prior to arthritis index date) as well as estimate the hazard ratio for a first incident diagnosis of CC after arthritis index data in those at risk (no history of CC before index date).

<----------------------------------------------------->
CC<---------->

-------------------------------------------------------01

-------------------------------------------------------x1

-------------------------------------------------------0<---------------------------------->CC

-------------------------------------------------------x<---------------------------------- >CC

T1 T2 T3 T4 T5 T6 T7 T8 T9
1994/95 1996/97 1998/99 2000/01 2002/03 2004/05 2006/07 2008/09 2010/11

I'm at a loss for how to get the data in the right format to do this.

Thanks!

Orit
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#8

28 May 2015, 09:49

Sorry diagram got pushed over once posted. Here it is again

<------------------------------------------------>
............................................... <------->

--------------------------------------CC--------0

---------------------------------------------------x

---------------------------------------0<-------------------------------------->

---------------------------------------x<------------------------------------- >CC

T1.............. T2.......... T3....... T4........ T5......... T6......... T7 .......T8............. T9
1994/95 1996/97 1998/99 2000/01 2002/03 2004/05 2006/07 2008/09 2010/11
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29964
#9

28 May 2015, 12:35

How do I create an outcome variable representing prevalence of a specific chronic condition in the 2 year period prior to index date?

You already have your index date, and since the survey waves are two years apart, I would do it like this:

Code:

by id (date), sort: gen int wave = _n gen byte index_wave = (date == index_date) gen byte condition_before_index_wave = max(cond(wave == index_wave-1, condition, .))

Notes:
1. If a participant missed the wave immediately preceding his/her index wave, you will get a missing value here.
2. I have used the variable "condition" to denote a 0/1 variable which shows the presence of the condition(s) you are focusing on. I leave it to you to create such a variable from whatever information about these chronic conditions you have.

Regarding matching, I think it is not a good idea to use age-bands for matching. Suppose you have age bands 35-39 and 40-44. Then you are willing to match a person who has just turned 40 with somebody who is 44 years 364 days old, but not with somebody who is 39 years 364 days old. It is usually better to match based on a maximum difference in age. The matching code I show below will work that way. I assume you are starting from a data set that contains, among other things, variables id, age, index_date, and that you have another variable, cc, identifying cases (1) and controls (0). You want an exact match on index_date, and an age match with a maximum difference of 5 years, and 3 controls per case. Bear in mind that there may not actually be three acceptable controls in your data for each case, so the end result will match each case with up to 3 controls.

Code:

keep id age index_date cc isid id, sort preserve keep if cc == 1 rename id case_id rename age case_age tempfile cases save `cases' restore keep if cc == 0 rename id control_id rename age control_age tempfile controls save `controls' use `cases', clear joinby index_date using `controls' gen delta_age = abs(case_age- control_age) keep if delta_age < 5

At this point, every case is now paired with every control it might acceptably match with. There are several ways to proceed at this point. You can simply select 3 cases at random from among these acceptable case-control pairs. Or you can select for each case, the three closest matching pairs (with ties on closeness of matching broken at random). I will assume you want to do the latter. There is another choice you can make: you can allow the same control to be matched to more than one case, or you can require that every case have distinct controls. The latter approach risks having a larger number of cases for whom no match can be found at all.

Code:

set seed your_lucky_number_here gen double shuffle1 = runiform() gen double shuffle2 = runiform() /* IF YOU WILL NOT ALLOW A CONTROL TO MATCH TO MORE THAN ONE CASE, THEN INSERT THE FOLLOWING CODE, WHICH RETAINS ONLY THE BEST MATCHING CASE FOR ANY CONTROL (BREAKING TIES AT RANDOM) by control_id (delta_age shuffle1 shuffle2), sort: keep if _n == _N */ by case_id (delta_age shuffle1 shuffle2), sort: keep if _n <= 3 drop delta_age shuffle1 shuffle2 save matching_index, replace

The file matching_index.dta will now contain up to 3 controls for each case. Each control will have the same index_date as the case, and will be within 5 years of age of the case. You can then -merge- this file with your other data to bring in the variables you need for your nested case-control study.

Note: The use of two double-precision random numbers to break ties is probably overkill, but you don't say how large your data set is, and if it is really huge one might really need that much in order to assure unique tie-breaking.
Comment
Orit Schieir

Join Date: May 2015

Posts: 18
#10

28 May 2015, 12:55

Wow. Thanks Clyde I cant thank you enough!

I will try this out in the nest few days and let you know how it goes.

Thanks again

Orit
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#11

28 May 2015, 15:39

sttocc will form risk groups from which you can draw matched control samples. Unfortunately, with such heavily grouped data, I don't know that it's poissible to estimate incidence-density ratios. You are better off with cloglog or logit models to predict arthritis at each year of interview.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement

help with data management for analysing timing of event occurence (matched cohort design)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment