Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How Stata deals with multicollinearity

    Hello

    I am investigating modelling diagnostic delay in patients using a time to event model. I am just using a fake dataset to begin with, made up by me. I have hospital names (just A, B, C...etc) and hospital type (district, base and tertiary) in order of size and presumed capability) and there is some correlation between these variables as you would expect (although thet correrlation coefficient is only about 0.3).

    Depending on how I compose the data, Stata will drop 2 of the hospital names because of collinearity. For example, if I have ( there are other independent variables but dataex won't display them all)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int pid byte age str1(sex hosp_name) str8 hosp_type
    100 88 "F" "H" "district"
    101 92 "F" "H" "district"
    102 83 "M" "H" "district"
    103 22 "F" "H" "district"
    104 36 "F" "H" "district"
    105 23 "F" "H" "district"
    106 54 "M" "H" "district"
    107 22 "F" "H" "district"
    108 24 "F" "H" "district"
    109 40 "F" "H" "district"
    110 35 "F" "H" "district"
    111 54 "M" "I" "tertiary"
    112 38 "M" "I" "tertiary"
    113 69 "F" "I" "tertiary"
    114 44 "F" "I" "tertiary"
    115 78 "F" "I" "tertiary"
    116 22 "M" "I" "tertiary"
    117 18 "F" "I" "tertiary"
    118 54 "M" "I" "tertiary"
    119 78 "M" "I" "tertiary"
    120 82 "M" "J" "base"    
    121 75 "M" "J" "base"    
    122 29 "F" "J" "base"    
    123 33 "F" "J" "base"    
    124  9 "M" "J" "base"    
    125 28 "F" "J" "base"    
    126  5 "F" "J" "base"    
    127 34 "F" "J" "base"    
    128 67 "F" "J" "base"    
    129 82 "M" "J" "base"    
    130 76 "F" "J" "base"    
    131 14 "M" "K" "tertiary"
    132 52 "F" "L" "district"
    end
    then I use

    Code:
    stcox age interventionyn  work_diagmade ib1.work_diagnostician ib1.time_pres ib1.hosp_type_cat ib5.hosp_name_cat ib6.class_cat ib1.hosp_dept_cat diag_delay_cat
    I get

    Code:
       failure _d:  finaldxmadeevent == 1
       analysis time _t:  timetofdhrs
    
    note: 10.hosp_name_cat omitted because of collinearity
    note: 11.hosp_name_cat omitted because of collinearity
    Iteration 0:   log likelihood = -508.85853
    Iteration 1:   log likelihood =  -455.5346
    Iteration 2:   log likelihood = -445.76463
    Iteration 3:   log likelihood = -445.36341
    Iteration 4:   log likelihood = -445.36065
    Iteration 5:   log likelihood = -445.36065
    Refining estimates:
    Iteration 0:   log likelihood = -445.36065
    
    Cox regression -- Breslow method for ties
    
    No. of subjects =          132                  Number of obs    =         132
    No. of failures =          127
    Time at risk    =         6599
                                                    LR chi2(33)      =      127.00
    Log likelihood  =   -445.36065                  Prob > chi2      =      0.0000
    
    ------------------------------------------------------------------------------------
                    _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
                   age |   1.003524    .004595     0.77   0.442     .9945586    1.012571
        interventionyn |   16.86043   22.57084     2.11   0.035     1.222859    232.4668
         work_diagmade |   1.284512    .537458     0.60   0.550     .5656964    2.916708
                       |
    work_diagnostician |
                    2  |   2.314053   .8398244     2.31   0.021     1.136193    4.712967
                    3  |    .704744   .3231542    -0.76   0.445     .2868933    1.731181
                    4  |   9.244435   11.13115     1.85   0.065      .872881     97.9052
                       |
             time_pres |
                    2  |   .5770304   .1574091    -2.02   0.044     .3380632    .9849165
                    3  |   .7103709   .2494742    -0.97   0.330     .3569052    1.413896
                       |
         hosp_type_cat |
             district  |   86.27767   161.0894     2.39   0.017     2.221346    3351.048
             tertiary  |   1.685938   2.812441     0.31   0.754     .0641045    44.33994
                       |
         hosp_name_cat |
                    A  |   18.94269   32.88749     1.69   0.090     .6304086    569.1953
                    B  |   2.255655   4.541459     0.40   0.686     .0436006    116.6952
                    C  |   1.240626   1.678445     0.16   0.873     .0875082    17.58867
                    D  |    5.45929     6.9921     1.33   0.185     .4435492    67.19399
                    F  |   17.18861   28.00398     1.75   0.081     .7054219    418.8251
                    G  |    30.2716   54.84421     1.88   0.060     .8687224    1054.848
                    H  |   .1174483   .0947504    -2.65   0.008     .0241628     .570882
                    I  |   1.086235   1.284395     0.07   0.944     .1070138    11.02575
                    J  |          1  (omitted)
                    K  |          1  (omitted)
                    L  |   .8346748   .9255692    -0.16   0.871     .0949777    7.335218
                       |
             class_cat |
                 CARD  |   1.139817   .8413863     0.18   0.859     .2682239    4.843645
                 DERM  |    .089733   .1292684    -1.67   0.094     .0053299    1.510722
                 ENDO  |   5.394387    5.95128     1.53   0.127     .6206777    46.88329
                  ENT  |   .7240182   .6892986    -0.34   0.734     .1120383    4.678778
                  GIT  |   .5346554   .4234097    -0.79   0.429     .1132353    2.524446
                NEURO  |   .1985368   .1604488    -2.00   0.045     .0407321    .9677098
                  O&G  |   2.595949   2.143298     1.16   0.248     .5146561    13.09408
                 OPTH  |   .0492302   .0735976    -2.01   0.044     .0026285    .9220434
                 ORTH  |   .4876306   .3422696    -1.02   0.306     .1232054    1.929977
                 RESP  |   .4791978    .358318    -0.98   0.325     .1106707    2.074899
                 RHEU  |   .2413165   .2718251    -1.26   0.207     .0265321    2.194837
                       |
         hosp_dept_cat |
                   OP  |   .0904296   .0515044    -4.22   0.000     .0296147    .2761303
                 WARD  |   .3969466   .2390002    -1.53   0.125     .1219626    1.291926
                       |
        diag_delay_cat |    .125937   .0572148    -4.56   0.000     .0516942    .3068071
    ------------------------------------------------------------------------------------
    with hospitals J and K omitted.

    However if I use a slightly different dataset (removing the the single entries at the end)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int pid byte age str1(sex hosp_name) str8 hosp_type
    100 88 "F" "H" "district"
    101 92 "F" "H" "district"
    102 83 "M" "H" "district"
    103 22 "F" "H" "district"
    104 36 "F" "H" "district"
    105 23 "F" "H" "district"
    106 54 "M" "H" "district"
    107 22 "F" "H" "district"
    108 24 "F" "H" "district"
    109 40 "F" "H" "district"
    110 35 "F" "H" "district"
    111 54 "M" "I" "tertiary"
    112 38 "M" "I" "tertiary"
    113 69 "F" "I" "tertiary"
    114 44 "F" "I" "tertiary"
    115 78 "F" "I" "tertiary"
    116 22 "M" "I" "tertiary"
    117 18 "F" "I" "tertiary"
    118 54 "M" "I" "tertiary"
    119 78 "M" "I" "tertiary"
    120 82 "M" "J" "base"    
    121 75 "M" "J" "base"    
    122 29 "F" "J" "base"    
    123 33 "F" "J" "base"    
    124  9 "M" "J" "base"    
    125 28 "F" "J" "base"    
    126  5 "F" "J" "base"    
    127 34 "F" "J" "base"    
    128 67 "F" "J" "base"    
    129 82 "M" "J" "base"    
    130 76 "F" "J" "base"    
    end
    then do the Cox analysis I get

    Code:
     failure _d:  finaldxmadeevent == 1
       analysis time _t:  timetofdhrs
    
    note: 9.hosp_name_cat omitted because of collinearity
    note: 10.hosp_name_cat omitted because of collinearity
    Iteration 0:   log likelihood = -498.92338
    Iteration 1:   log likelihood = -446.34793
    Iteration 2:   log likelihood = -436.90194
    Iteration 3:   log likelihood = -436.51654
    Iteration 4:   log likelihood = -436.51395
    Iteration 5:   log likelihood = -436.51395
    Refining estimates:
    Iteration 0:   log likelihood = -436.51395
    
    Cox regression -- Breslow method for ties
    
    No. of subjects =          130                  Number of obs    =         130
    No. of failures =          125
    Time at risk    =         6543
                                                    LR chi2(31)      =      124.82
    Log likelihood  =   -436.51395                  Prob > chi2      =      0.0000
    
    ------------------------------------------------------------------------------------
                    _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------------+----------------------------------------------------------------
                   age |   1.003466   .0045866     0.76   0.449     .9945163    1.012496
        interventionyn |   16.63677    22.2671     2.10   0.036     1.207253    229.2661
         work_diagmade |   1.278301   .5334783     0.59   0.556     .5641544    2.896466
                       |
    work_diagnostician |
                    2  |   2.299247   .8330164     2.30   0.022     1.130305    4.677089
                    3  |   .7099638   .3245912    -0.75   0.454     .2897824    1.739404
                    4  |   9.025128    10.8631     1.83   0.068     .8529121     95.4998
                       |
             time_pres |
                    2  |    .576285   .1569979    -2.02   0.043     .3378654    .9829489
                    3  |   .7086565   .2483589    -0.98   0.326     .3565496    1.408483
                       |
         hosp_type_cat |
             district  |   86.06734   160.6045     2.39   0.017      2.22059    3335.865
             tertiary  |   1.861716   2.431047     0.48   0.634     .1440145    24.06693
                       |
         hosp_name_cat |
                    A  |   17.32781   25.69319     1.92   0.054     .9475577    316.8704
                    B  |   2.090338     3.6887     0.42   0.676     .0657885     66.4176
                    C  |   1.260545   1.703829     0.17   0.864     .0891297    17.82765
                    D  |     4.9162    3.42367     2.29   0.022      1.25559    19.24913
                    F  |   15.52633   20.80612     2.05   0.041     1.123087    214.6468
                    G  |   30.46847    55.1695     1.89   0.059     .8761396    1059.566
                    H  |   .1188044   .0956162    -2.65   0.008      .024534    .5753029
                    I  |          1  (omitted)
                    J  |          1  (omitted)
                       |
             class_cat |
                 CARD  |   1.140103   .8399422     0.18   0.859      .269056    4.831094
                 DERM  |   .0914465   .1316415    -1.66   0.097     .0054428    1.536424
                 ENDO  |   5.369706   5.922601     1.52   0.128      .618165    46.64409
                  ENT  |   .7334168    .697303    -0.33   0.744     .1137792    4.727579
                  GIT  |   .5348655   .4225605    -0.79   0.428     .1137022    2.516057
                NEURO  |   .2021763   .1629472    -1.98   0.047     .0416573    .9812273
                  O&G  |   2.583215   2.129421     1.15   0.250       .51344    12.99665
                 OPTH  |    .050093   .0748447    -2.00   0.045     .0026791    .9366326
                 ORTH  |   .4921213   .3442852    -1.01   0.311     .1249041    1.938955
                 RESP  |   .4831219   .3605901    -0.97   0.330     .1118771    2.086279
                 RHEU  |   .2495317   .2809191    -1.23   0.218     .0274698    2.266709
                       |
         hosp_dept_cat |
                   OP  |   .0925485   .0525888    -4.19   0.000     .0303873     .281869
                 WARD  |   .4003935   .2408648    -1.52   0.128     .1231486    1.301801
                       |
        diag_delay_cat |   .1281343   .0579422    -4.54   0.000     .0528144    .3108694
    ------------------------------------------------------------------------------------
    with I and J omitted.

    I'm curious how/why Stata chooses these different hospital names to omit. I guess this is just the way the algorithms go but it always chooses two categories to omit no matter how I change the data.

    If you leave hospital type out of the model, no hospital names are dropped.

    Thanks and regards

    Chris
    Last edited by Chris Dalton; 11 Jan 2024, 23:50.

  • #2
    Chris:
    probably it depends on how many variables contributes to multicollinearity, as in the following toy-example:
    Code:
    . use "C:\Program Files\Stata18\ado\base\a\auto.dta"
    (1978 automobile data)
    
    . sysuse auto.dta
    (1978 automobile data)
    
    . reg price i.foreign foreign
    note: foreign omitted because of collinearity.
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =      0.17
           Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
        Residual |   633558013        72  8799416.85   R-squared       =    0.0024
    -------------+----------------------------------   Adj R-squared   =   -0.0115
           Total |   635065396        73  8699525.97   Root MSE        =    2966.4
    
    ------------------------------------------------------------------------------
           price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
         foreign |
        Foreign  |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
         foreign |          0  (omitted)
           _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
    ------------------------------------------------------------------------------
    
    . gen inv_foreign=0 if foreign==1
    (52 missing values generated)
    
    . replace inv_foreign=1 if inv_foreign==.
    (52 real changes made)
    
    . reg price i.foreign foreign inv_foreign
    note: foreign omitted because of collinearity.
    note: inv_foreign omitted because of collinearity.
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(1, 72)        =      0.17
           Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
        Residual |   633558013        72  8799416.85   R-squared       =    0.0024
    -------------+----------------------------------   Adj R-squared   =   -0.0115
           Total |   635065396        73  8699525.97   Root MSE        =    2966.4
    
    ------------------------------------------------------------------------------
           price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
         foreign |
        Foreign  |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
         foreign |          0  (omitted)
     inv_foreign |          0  (omitted)
           _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
    ------------------------------------------------------------------------------
    
    .
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      This problem is arising because you have both hospital type, which is an invariant characteristic of a given hospital, and hospital name in the data. So in addition to dropping one category of hospital name as the base category, there is a colinearity among the indicators for hospital type and hospital name which necessitates omitting two more variables in order to identify the model. Mathematically, it could be done by removing two hospital names, or two of the hospital types, or one of each, or by adding some constraints on these variables. Those would produce the same predictive model, though the coefficients would be different among the colinear variables.

      As for how Stata chooses which of these options to take, that is internal to the algorithms and is not disclosed. Generally speaking it seems to choose variables near the end of the variable list in the estimation command. But in any case that behavior is not guaranteed, and even if we knew what it is, it would be hazardous to rely on that--it could change with the next update! You can, of course, control this yourself by using the io. operator. (-help fvvarlist- if unfamiliar with this)

      But better still is to note include colinear variables in your model since the resulting coefficients are meaningless. If this were my project, I would omit the hospital type variable--you cannot estimate its effects while also including indicators for each hospital--it is mathematically impossible.

      Comment


      • #4
        Thank you both for your helpful explanations here.
        I suspected it would require using just one of these predictors, not both.
        Regards
        Chris

        Comment

        Working...
        X