Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adding multiply imputed data using Rubin's rules into registered multiple imputation variables

    Hello,

    I am currently performing a survival analysis project for melanoma (a form of skin cancer). I am reasonably new to Stata having only started using in past 4 months.
    I have been using a Cox proportional hazard model thus far in my analyses.
    Within the dataset of approximately 3,600 observations there are up to 20% missing variables.
    I have explored exclusion and other missing variable methods however too many of my failures would be lost for my analysis (currently total 400 failures which are melanoma specific deaths)
    I have ended up choosing the utility of multiple imputation using chained equations (MICE) given that some of the key prognostic variables are not normally distributed and heavily skewed.
    To begin with I have selected key prognostic values recorded within the dataset for melanoma being Breslow thickness of melanoma (continuous), ulceration status (binary) and mitotic rate (classified as ordinal categorical variable). I have selected independent variables where data is complete (no missing observations) - age, melanoma subtype, sex, subsite location as well as outcome indicator and survival hazard function.

    Below is my code thus far for imputation, I am fairly happy with the mi estimate coefficients very closely mirroring the coefficients estimated from non-imputed dataset.
    My question to the forum is what would be the appropriate process/syntax to incoporate the imputed values into the incomplete/missing datapoints to allow continuation of my survival analysis models with a 'complete' dataset? (apologies if I have not worded this correctly and if this is a basic question- I have trawled through the Statalist forums and other useful sites such as UCLA and various MI lectures as well as the Stata manual but could not find this process described; I have also found the MI menu interface tricky to follow)

    Code:
     mi stset timem, failure(censor2==1) scale(1)
    mi set mlong
    mi register imputed breslow ulcer mitosescat4
    mi impute chained (regress) breslow (logit) ulcer (ologit) mitosescat4 = agecat2 subtype sex subsitecat4 matthews_haz censor2, add(10)
    mi estimate: regress breslow i.ulcer i.mitosescat4
    Many thanks in advance,
    Last edited by Matthew Howard; 14 Jul 2018, 05:11.

  • #2
    Matthew:
    welcome to this forum.
    Via -mi- you obtain a number of complete datasets (if I'm not mistaken, various contributions advise something like 5-50 complete datasets) and -mi- allows you to re-run your regression model taking poist estimates, within and between variances into account (as per Rubin's rule, as you mention).
    If, after -mi- we want an unique dataset (if I got you correctly, you mean something like a mix of original and imputed data), we should probably consider something like -append- and then -collapse- with the -mean- function of the complete datasets (by the way, I do not really know whether Stata allows this procedure) and re-run the regression model on this made-up dataset. However, this approach, if feasible, will cause the loss of part of the variance that -mi- creates. Hence, even if what above was technically feasible, the regression outcome would be probably flawed.
    At the risk of being late to the party, I would recommend you the following article, which, in my opinion, gives one of the best example of dealing with missing values via multiple imputation in biostatistics: https://www.ncbi.nlm.nih.gov/pubmed/12589867.
    Last edited by Carlo Lazzaro; 14 Jul 2018, 05:39.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Thanks for your time Carlo
      That article was very helpful to read, certainly what you have mentioned makes sense to me.
      If I read it correctly it suggests rerunning my initial survival models (in my case Cox PH regression models) with the imputed datasets and determining their mean value?
      This sounds rather tricky to complete in Stata, have you had any experience with converting this type of theory into practical code?

      Comment


      • #4
        Matthew:
        I meant that you should follow the -mi estimate- approach after multiple imputation.
        That is:
        Code:
        mi estimate: stcox <indepvars>
        See also example #3, -mi estimate- entry, Stata .pdf manual.
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          Many thanks again Carlo,
          I had somehow made the assumption that the mi estimate command was purely only for diagnostic purposes rather than obtaining post imputation estimates- this certainly makes analysis much more efficient and straightforward!

          Best wishes

          Comment


          • #6
            Matthew:
            what you mean can be easily checked via the following toy-example:
            Code:
            . webuse mheart1s20
            (Fictional heart attack data; bmi missing)
            
            . mi describe
            
              Style:  mlong
                      last mi update 20jan2017 14:52:04, 216 days ago
            
              Obs.:   complete          132
                      incomplete         22  (M = 20 imputations)
                      ---------------------
                      total             154
            
              Vars.:  imputed:  1; bmi(22)
            
                      passive:  0
            
                      regular:  5; attack smokes age female hsgrad
            
                      system:   3; _mi_m _mi_id _mi_miss
            
                     (there are no unregistered variables)
            
            . mi estimate, dots: logit attack smokes age bmi hsgrad female *this is the outcome of -logit- after -mi- (20 complete datasets created)*
            Imputations (20):
              .........10.........20 done
            
            Multiple-imputation estimates                   Imputations       =         20
            Logistic regression                             Number of obs     =        154
                                                            Average RVI       =     0.0312
                                                            Largest FMI       =     0.1355
            DF adjustment:   Large sample                   DF:     min       =   1,060.38
                                                                    avg       = 223,362.56
                                                                    max       = 493,335.88
            Model F test:       Equal FMI                   F(   5,71379.3)   =       3.59
            Within VCE type:          OIM                   Prob > F          =     0.0030
            
            ------------------------------------------------------------------------------
                  attack |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  smokes |   1.198595   .3578195     3.35   0.001     .4972789    1.899911
                     age |   .0360159   .0154399     2.33   0.020     .0057541    .0662776
                     bmi |   .1039416   .0476136     2.18   0.029      .010514    .1973692
                  hsgrad |   .1578992   .4049257     0.39   0.697    -.6357464    .9515449
                  female |  -.1067433   .4164735    -0.26   0.798    -.9230191    .7095326
                   _cons |  -5.478143   1.685075    -3.25   0.001    -8.782394   -2.173892
            ------------------------------------------------------------------------------
            
            . logit attack smokes age bmi hsgrad female if _mi_m==0 *this is the outcome of -logit- when Stata applies listwise deletion*
            
            Iteration 0:   log likelihood = -91.359017 
            Iteration 1:   log likelihood = -79.374749 
            Iteration 2:   log likelihood = -79.342218 
            Iteration 3:   log likelihood =  -79.34221 
            
            Logistic regression                             Number of obs     =        132
                                                            LR chi2(5)        =      24.03
                                                            Prob > chi2       =     0.0002
            Log likelihood =  -79.34221                     Pseudo R2         =     0.1315
            
            ------------------------------------------------------------------------------
                  attack |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                  smokes |   1.544053   .3998329     3.86   0.000     .7603945    2.327711
                     age |    .026112    .017042     1.53   0.125    -.0072898    .0595137
                     bmi |   .1129938   .0500061     2.26   0.024     .0149837     .211004
                  hsgrad |   .4048251   .4446019     0.91   0.363    -.4665786    1.276229
                  female |   .2255301   .4527558     0.50   0.618    -.6618549    1.112915
                   _cons |  -5.408398   1.810603    -2.99   0.003    -8.957115    -1.85968
            ------------------------------------------------------------------------------
            
            .
            Kind regards,
            Carlo
            (StataNow 18.5)

            Comment

            Working...
            X