Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fine and Gray model with categorical time varying covariate

    Hi everyone,

    I am working on a project that investigates the relationship between obesity and the incidence of cancer in dialysis patients. By incorporating both baseline obesity status and chronological changes in BMI (which can happen multiple times) as time-varying covariates, the study aims to determine whether obesity impacts the risk of developing cancer and cancer-related mortality, while accounting for competing risks such as getting a transplant and non-cancer death. The analysis uses a Fine and Gray competing risks model to assess these associations.

    I have attached how my data look like after I formatted each person in multiple line according to their chronological BMI changes.


    This is by using
    Code:
      stset bmi_time, failure(failuretype==1) id(id) origin(startdate) scale(365.25)
    bmi_date is when the obesity status has been documented
    bmi_time is when the obesity status ended
    failuretype =1 is the event of interest
    failure type 2 and 3 are competing risks

    I then use
    Code:
     stcrreg i.obese, compete(failuretype==2 3)
    to fit Fine and Gray model. After this I will add other potential confounders to the model.

    My questions are:
    1. Is my approach correct? This is my first time working with time-varying covariates with multiple changes. I manually split the data without using the "split at failure times" method, which is often recommended. Interestingly, my Fine and Gray model shows a reduced risk for the obese cohort, while the Kaplan-Meier curve suggested an elevated risk. Could this discrepancy be due to my approach?
    2. The model runs very slowly, and I need to fit it multiple times to test different covariates. Is this typical, and are there any ways to improve the speed?
    3. For an upcoming analysis, I will have both transplant status and obesity status as time-varying covariates. Can I apply the same approach that I used here?
    Many thanks to your help in advance!

    Kind Regards,
    Bree

  • #2
    Sorry, there is no discrepancy between Kaplan Meier and F-G model output.

    Comment


    • #3
      I have attached how my data look like after I formatted each person in multiple line according to their chronological BMI changes.
      No, you haven't. There is some glyph in your post that follows that sentence, but there is no actual attachment.

      On top of that, as the FAQ reminds us all, the helpful way to show data is not with attachments (which some of us will not download, for safety reasons or because it's just too much trouble), but with the -dataex- command. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

      Since the answer to your question depends entirely on how your data are organized, I don't think anyone can answer until you actually show it.

      Comment


      • #4
        Hi Clyde,

        Thank you for your reply. I have copied the code below.

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str9 id int startdate float(enddate endstatus failuretype obese_vary bmi_date bmi_time last_record pid) byte(_st _d) int _origin double(_t _t0)
        "P01" 14610 15194 0 2 0 14610 15194 1 1 1 0 14610 1.5989048596851472                  0
        "P02" 14610 18547 0 0 1 14610 14700 0 2 1 0 14610  .2464065708418891                  0
        "P02" 14610 18547 1 1 0 14700 18547 1 2 1 1 14610 10.778918548939084  .2464065708418891
        "P03" 14610 15448 0 0 1 14610 15065 0 3 1 0 14610 1.2457221081451062                  0
        "P03" 14610 15448 0 0 0 15065 15248 0 3 1 0 14610 1.7467488021902806 1.2457221081451062
        "P03" 14610 15448 0 0 1 15248 15430 0 3 1 0 14610  2.245037645448323 1.7467488021902806
        "P03" 14610 15448 0 2 0 15430 15448 1 3 1 0 14610  2.294318959616701  2.245037645448323
        "P04" 14610 17049 0 0 1 14610 14700 0 4 1 0 14610  .2464065708418891                  0
        "P04" 14610 17049 0 2 0 14700 17049 1 4 1 0 14610  6.677618069815195  .2464065708418891
        "P05" 14610 14789 0 0 1 14610 14700 0 5 1 0 14610  .2464065708418891                  0
        end
        format %td startdate
        format %td enddate
        format %td bmi_date
        format %td bmi_time
        format %td _origin
        label values endstatus es
        label def es 0 "Censored (alive/transplant/lost to fu)", modify
        label def es 1 "De novo cancer (cancer/cancer death/withdrawal death - malignancy)", modify
        label values obese_vary obese_vary
        label def obese_vary 0 "Not obese", modify
        label def obese_vary 1 "Obese", modify

        Many Thanks,
        Bree
        Last edited by Bree Shi; 05 Oct 2024, 20:47.

        Comment


        • #5
          This looks good to me.

          As a matter of style, I would not name the variables marking the interval for obese_vry, bmi_date and bmi_time. That's because, in Stata, date and time variables are different things that have to be manipulated with different functions. But both of those variables are what Stata would call date variables. And even in a different system that doesn't have separate data types for dates and times, if you had to come back and look at this code 6 months after you last saw it, would you know from the names what these variables represent? I don't think so. So I would probably call them bmi_from and bmi_to, or something similar where the name actually tells you what the variable is about. But, as I said, this is a matter of coding style--it won't affect the way your code works, just the way you or other readers/coders will relate to it.

          Comment


          • #6
            Hi Clyde and all,

            Thanks so much for the help so far! I'm encountering a significant issue with my current model. I have around 90,000 observations for 70,000 subjects and 4,000 events, and it's taking over an hour to run the following command alone:
            Code:
            stcrreg i.obese_vary, compete(failuretype==2 3)
            I then tried:
            Code:
            stepwise, pr(0.05): stcrreg i.obese_vary i.agecat_dx i.male i.ethnicity  i.lasttreatment i.prd i.lung i.cv i.diabetes i.smoking , compete(failuretype==2 3)
            This has been running for over 24 hours without any output. I’m using an iMac with a 3.6 GHz 8-Core Intel Core i9 processor.

            My questions are:
            1. Is there anything I can do to optimize performance, or do I need a more powerful machine?
            2. My next step is to add transplant status as a time-varying covariate, alongside BMI, which will make the dataset even larger. Are there any strategies for handling models of this size more efficiently?
            Any advice would be greatly appreciated!

            Thanks,
            Bree

            Comment


            • #7
              These run times do not surprise me; if anything, given the size of the data set, I would have expected them to take even longer. -stcrreg- is much slower than the corresponding -stcox- commands. You might be able to speed it up by doing the following:

              Code:
              stset, clear
              contract bmi_time failuretype id startdate bmi_vary
              
              stset bmi_time [fweight = _freq], failure(failuretype==1) id(id) origin(startdate) scale(365.25)
              stcrreg i.obese_vary, compete(failuretype==2 3)
              What this does is aggregate up observations that are identical and then runs the analysis on the aggregated data with frequency weights. By eliminating redundant information and reducing the size of the data set, the estimation may be speeded up. Just how much of a gain this provides depends on the extent of the redundancy in your data set. At the extreme, if there are no observations with identical values on all of the model variables, then there will be no size reduction and no speedup. At the other extreme, if there are only, say, 1,000 truly distinct observations in the data set, you will see a dramatic improvement in performance.


              As for the -stepwise- analysis, I am the wrong person to ask for advice about this. In my opinion, -stepwise- estimation is an entirely bogus procedure. I am not aware of any use case for which it produces valid results. I would never advise anybody to use it at all, let alone how to make it run faster.

              Comment


              • #8
                Thank you Clyde.

                I am trying to plot the unadjusted CIFs based on the data. However, the CIF returned looks very wrong. I don't know what I have done wrong here. Would this be because my treatment group is a time varying covariate (i.e. obesity status)? I have attached the code and graph below.


                Code:
                stcompet CI = ci hilim = hi lowlim = lo, compet1(2) compet2(3) by(obese_vary)    
                gen CI_no=CI if obese_vary==0
                label var CI_no "Not obese"
                gen CI_ob=CI if obese_vary==1
                label var CI_ob "Obese"
                gen hi_no=hilim if obese_vary==0
                gen lo_no=lowlim if obese_vary==0
                gen hi_ob =hilim if obese_vary==1
                gen lo_ob=lowlim if obese_vary==1
                twoway line CI_* _t if failuretype,sort yti("Cumulative Incidence") xti("time") by(failuretype, style(compact) rows(1) yrescale legend(pos(12)) note("")) subtitle(, ring(0) pos(11) nobexpand) clp("-##" solid)  xla(0(5)20) xsca(range(20))
                Graph.gph
                Attached Files

                Comment


                • #9
                  I don't see anything wrong with the code you show. As for obesity status being a time-varying covariate, it probably is not. It may vary over time, but, in survival analysis, that is not with "time varying covariate" means. A time-varying covariate, in survival analysis, means a variable whose effect on survival varies over time. In any other regression context this would be referred to as an effect modifier and it would be reflected in modeling by the presence of an interaction with the time variable. (Equivalently, in a Cox regression it would be a variable for which the proportional hazards assumption is violated.) For reasons unknown to me, this came to be known as a "time varying covariate" in survival analysis, despite the fact that the face meaning of that term is something quite different.

                  Even if obesity actually is a time varying covariate within the survival-analysis meaning of that term, that would have no impact on the analysis you have done. This is because you have not done any kind of covariate-adjustment with it: instead you have conditioned analyses on the separate values it takes on, and you are calculating separate CIF curves for the obese and non-obese. So that's just not the issue. It could be an issue if you were running -stcrreg-, wherein you would be using it as a covariate to adjust the estimation of a hazard ratio, and if there is a violation of proportional hazards you would get incorrect results without incorporating an interaction with time.

                  Not knowing what the population you are modeling is, I can't really offer an opinion about whether the results you are seeing are plausible. The one thing that clearly stands out as inherently implausible, however, is the abrupt downward drop shown at time ~15 in the graft failure panel. I don't know what to make of that.

                  Anyway, I don't see anything wrong with your modeling and I suspect that the problem is in the data itself, though I have no specific advice about what particular kind of data error you should prioritize in your search.

                  Added: I also can't rule out the possibility that there is a bug in -stcompet-; after all the cumulative incidence should be a non-decreasing function of time, no matter what input was provided.

                  Further Added: Have you tried generating the CIF curves using -stcrereg- and -stcurve- instead? If you get implausible results there, then I would say you have a data problem. If you get reasonable results from that, then I would say there is probably a bug in -stcompet-.
                  Last edited by Clyde Schechter; 03 Jan 2025, 10:49.

                  Comment

                  Working...
                  X