Case-cohort analysis

Scott Adams

Join Date: Sep 2014

Posts: 51
#1

Case-cohort analysis

05 Sep 2014, 16:11

Hi,
I'm looking for a little help analyzing data from a case-cohort study. For reference, a case-cohort study is not a case-control study and not quite a normal cohort study; it is a special study design of its own, but I don't want to use up a lot of space describing it here. A good article about it is Barlow (here http://www.ncbi.nlm.nih.gov/pubmed/10580779 ) except note that the methods there don't quite apply to my data, as described below.

There are a few tools in another statistical package but fewer for this design in Stata, which I prefer.

In some detail, I have age-stratified case-cohort data that includes about 1000 sub-cohort members and 500 cases. The sub-cohort is an age-stratified sample of the larger cohort of about 11,000 people. This falls under the "confounder stratified" design described in Langholz and Jiao, Computational Statistics and Data Analysis, 51:3737-3748 (2007).
[http://www.sciencedirect.com/science...7947306005068]
(The stratified design makes the Barlow article not quite fit.)

I know that there are user-made packages called STCASCOH and STSELPRE for case-cohort data.

So I was wondering:
1. Is there anything more recent for analysis of case-cohort data?
2. It seems like STCASCOH and STSELPRE are a bit limited and do not handle, e.g, robust vce or stratified data. Is that true?
3. On the other hand just using STCOX stratified on the age group with robust vce and an offset seems to get me very close to the examples results in Langholz and Jiao (Table 2(B) with the very helpful supplied model data with that paper's supplementary material). But, I am worried that (as an epidemiologist and not a statistician) I am missing something, and will be doing bad things when I use this on my real dataset. The model dataset is small and the difference between some of the example results are pretty small, so hard to say if I am just close by luck. However, STSELPRE does not produce the published results with any set of options I have found.

Does anyone know what would be a correct way to analyze these data with Stata? Can you help me justify the use of STCOX with the options correctly specified based on the method and theory laid out by Langholz & Jiao 2007?

I am happy to share even more details of how close I get to L&J 2007 as needed, and etc., if someone might have an idea what is going on.

Thanks,
Scott
Tags: None
Murray Weeks

Join Date: Jul 2016

Posts: 1
#2

28 Jul 2016, 08:39

Hi Scott,
I was wondering the same thing... whether STCOX would be adequate or whether one of these user-written programs would be needed. I would be very interested to know if you had found an answer to this question elsewhere.

Cheers,
Murray
Comment
Scott Adams

Join Date: Sep 2014

Posts: 51
#3

12 Oct 2016, 15:53

Hi Murray, Sorry not to have seen this earlier.
I did find a resolution, I have copied an earlier post or message I sent to another user. I'm not sure why it does not appear here with this post-- I'm not good with this forum.
Unfortunately I have since changed employers and it would be a little tricky for me to get the original code. Hope the below helps...
*****

Hi Johanna,
As a matter of fact, I think I did work through it and I am happy to share. (The results were accepted for publication so ...) I will send you what I worked out, but it will take me a couple of days -- I am not at my desk for the next of couple of days and maybe don't remember all the details. However, I think that I used the -stcascoh- commands with the work-around described here,
http://www.stata.com/statalist/archi.../msg01199.html
in order to set up the data for case-cohort sampling.

The article from Langholz and Jiao,
http://www.sciencedirect.com/science...67947306005068
is also helpful because it provides some sample data -- so you can see what set of options in Stata are equivalent to the analyses they describe.
Using these two sources I was able to convince myself, at least, that I could do stratified case-cohort analysis. At the moment as I said I am away from the office, so I am not 100% sure what the final set of options was that I thought made the most sense (and it may depend on your data and what you want to do). But I will get back to you.
In the meantime I would suggest looking at that article from Langholz and Jiao and reproducing their results from their sample data, to get an idea how to make it work in Stata.
Best,
Scott

Johanna,
I do highly recommend working through the L&J paper. For me, working through an example and getting the "known" answer from my code is extremely reassuring.

Here is my code for setting up the data for stratified (by age) case-cohort analysis. Age in this analysis is a confounder. (As I am sure you know `whatever' is a macro, in this case they are filenames. This helps me make sure I always use the right file version throughout a very long do-file...) ER+ refers to estrogen receptor positive breast cancer-- but in general any case subype I think.
At the very bottom you can see the -stcox- option I use to run the model. In the end I stratified on the age sampling and included an offset(logw) where logw is the log of the sampling weight. As a practical matter in my cohort, it did not change results appreciably (i.e., HR changed by ~0.001 or something).
Best of luck,
Scott

----------------------------------------

use "`datafile0'", clear

count

//stset additional-case only file (cases not in subcohort)
keep if subcoho==0 & case==1
save "`datafilecases'", replace
count
stset time, id(id) fail(case==1)
stcascoh , alpha(0)
save "`datacaseST'", replace

//stset subcohort file (includes overlap cases)
use "`datafile0'", clear
keep if subcoho==1
count
save "`datafilesub'", replace

stset time, id(id) fail(case==1)
stcascoh , alpha(0.99999999999)
save "`datasubcohST'", replace

//append datasets after the STCASCOH
append using "`datacaseST'"

//change barlow weights to correct values for subcohort (N=1050)
// out of full cohort N=9905
// these are non-age-stratified weight, not very useful.
replace _wBarlow=ln(1050/9905) if _wBarlow>0

// 2-6-15
//change logw weights to correct values for second record (the failure) for
//cases in the sub-cohort
replace logw=0 if _subco==1 & _t0!=0 //this records enters after t=0 so it's the failure
//change weight to logw=0 for cases NOT in the subcohort
replace logw=0 if _subco==2

// some proofing of set-up
tab _subco
tab _subco censorstat ,m
// 1. cases in the sub-cohort (N=50) should have two records. The first enters
// at enrollment and exits just before failure, with log(w)=log(n/m)
// The second enters only just before failure and exits at failure, with log(w)=0.
sort id
list id _subco _t0 _t logw if _subco==1 in 1/100

// 2. Cases not in the subcohort
// -enter just prior to failure
//have offset of log(w)=0
list id _subco _t0 _t logw if _subco==2 in 1/10

//3. subcohort members (non-cases) should have t0=0 and some exit time
// and logw=log(n/m)
//list only a subset
list id _subco _t0 _t agesample n m logw if _subco==0 in 1/20
save "`datafinal'", replace

//4. take care with cases also included in sub-cohort
// Need some clarity for example, when doing sub-analyses restricted to
//ER+ cases (or any characteristic of cases). In that circumstance, want to
//drop the case record for ER- but keep the subcohort record.

sort id
duplicates report id
duplicates tag id, gen(case_and_sub)
label var case_and_sub "Case also in subcohort"
order case_and_sub, after(subcoho)

save "`datafinal'", replace

...>some other stuff not relevant<....

stcox i.treatment `adj0' , strat(agesample) robust offset(logw)

//note `adj0' is a macro containing all the adjustment variables, again I use this just for consistency
// e.g., adj0= "i.bmi i.haircolor" or whatever variable you want to adjust for
//at this moment I don't quite recall how I decided on this offset, which is the sampling weights (or log of them),probably something L&J say.
Comment

Announcement

Case-cohort analysis

Comment

Comment