How Stata deals with multicollinearity

Chris Dalton

Join Date: Mar 2018
Posts: 31

How Stata deals with multicollinearity

11 Jan 2024, 22:44

Hello

I am investigating modelling diagnostic delay in patients using a time to event model. I am just using a fake dataset to begin with, made up by me. I have hospital names (just A, B, C...etc) and hospital type (district, base and tertiary) in order of size and presumed capability) and there is some correlation between these variables as you would expect (although thet correrlation coefficient is only about 0.3).

Depending on how I compose the data, Stata will drop 2 of the hospital names because of collinearity. For example, if I have ( there are other independent variables but dataex won't display them all)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int pid byte age str1(sex hosp_name) str8 hosp_type
100 88 "F" "H" "district"
101 92 "F" "H" "district"
102 83 "M" "H" "district"
103 22 "F" "H" "district"
104 36 "F" "H" "district"
105 23 "F" "H" "district"
106 54 "M" "H" "district"
107 22 "F" "H" "district"
108 24 "F" "H" "district"
109 40 "F" "H" "district"
110 35 "F" "H" "district"
111 54 "M" "I" "tertiary"
112 38 "M" "I" "tertiary"
113 69 "F" "I" "tertiary"
114 44 "F" "I" "tertiary"
115 78 "F" "I" "tertiary"
116 22 "M" "I" "tertiary"
117 18 "F" "I" "tertiary"
118 54 "M" "I" "tertiary"
119 78 "M" "I" "tertiary"
120 82 "M" "J" "base"    
121 75 "M" "J" "base"    
122 29 "F" "J" "base"    
123 33 "F" "J" "base"    
124  9 "M" "J" "base"    
125 28 "F" "J" "base"    
126  5 "F" "J" "base"    
127 34 "F" "J" "base"    
128 67 "F" "J" "base"    
129 82 "M" "J" "base"    
130 76 "F" "J" "base"    
131 14 "M" "K" "tertiary"
132 52 "F" "L" "district"
end

then I use

Code:

stcox age interventionyn  work_diagmade ib1.work_diagnostician ib1.time_pres ib1.hosp_type_cat ib5.hosp_name_cat ib6.class_cat ib1.hosp_dept_cat diag_delay_cat

I get

Code:

   failure _d:  finaldxmadeevent == 1
   analysis time _t:  timetofdhrs

note: 10.hosp_name_cat omitted because of collinearity
note: 11.hosp_name_cat omitted because of collinearity
Iteration 0:   log likelihood = -508.85853
Iteration 1:   log likelihood =  -455.5346
Iteration 2:   log likelihood = -445.76463
Iteration 3:   log likelihood = -445.36341
Iteration 4:   log likelihood = -445.36065
Iteration 5:   log likelihood = -445.36065
Refining estimates:
Iteration 0:   log likelihood = -445.36065

Cox regression -- Breslow method for ties

No. of subjects =          132                  Number of obs    =         132
No. of failures =          127
Time at risk    =         6599
                                                LR chi2(33)      =      127.00
Log likelihood  =   -445.36065                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------------
                _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
               age |   1.003524    .004595     0.77   0.442     .9945586    1.012571
    interventionyn |   16.86043   22.57084     2.11   0.035     1.222859    232.4668
     work_diagmade |   1.284512    .537458     0.60   0.550     .5656964    2.916708
                   |
work_diagnostician |
                2  |   2.314053   .8398244     2.31   0.021     1.136193    4.712967
                3  |    .704744   .3231542    -0.76   0.445     .2868933    1.731181
                4  |   9.244435   11.13115     1.85   0.065      .872881     97.9052
                   |
         time_pres |
                2  |   .5770304   .1574091    -2.02   0.044     .3380632    .9849165
                3  |   .7103709   .2494742    -0.97   0.330     .3569052    1.413896
                   |
     hosp_type_cat |
         district  |   86.27767   161.0894     2.39   0.017     2.221346    3351.048
         tertiary  |   1.685938   2.812441     0.31   0.754     .0641045    44.33994
                   |
     hosp_name_cat |
                A  |   18.94269   32.88749     1.69   0.090     .6304086    569.1953
                B  |   2.255655   4.541459     0.40   0.686     .0436006    116.6952
                C  |   1.240626   1.678445     0.16   0.873     .0875082    17.58867
                D  |    5.45929     6.9921     1.33   0.185     .4435492    67.19399
                F  |   17.18861   28.00398     1.75   0.081     .7054219    418.8251
                G  |    30.2716   54.84421     1.88   0.060     .8687224    1054.848
                H  |   .1174483   .0947504    -2.65   0.008     .0241628     .570882
                I  |   1.086235   1.284395     0.07   0.944     .1070138    11.02575
                J  |          1  (omitted)
                K  |          1  (omitted)
                L  |   .8346748   .9255692    -0.16   0.871     .0949777    7.335218
                   |
         class_cat |
             CARD  |   1.139817   .8413863     0.18   0.859     .2682239    4.843645
             DERM  |    .089733   .1292684    -1.67   0.094     .0053299    1.510722
             ENDO  |   5.394387    5.95128     1.53   0.127     .6206777    46.88329
              ENT  |   .7240182   .6892986    -0.34   0.734     .1120383    4.678778
              GIT  |   .5346554   .4234097    -0.79   0.429     .1132353    2.524446
            NEURO  |   .1985368   .1604488    -2.00   0.045     .0407321    .9677098
              O&G  |   2.595949   2.143298     1.16   0.248     .5146561    13.09408
             OPTH  |   .0492302   .0735976    -2.01   0.044     .0026285    .9220434
             ORTH  |   .4876306   .3422696    -1.02   0.306     .1232054    1.929977
             RESP  |   .4791978    .358318    -0.98   0.325     .1106707    2.074899
             RHEU  |   .2413165   .2718251    -1.26   0.207     .0265321    2.194837
                   |
     hosp_dept_cat |
               OP  |   .0904296   .0515044    -4.22   0.000     .0296147    .2761303
             WARD  |   .3969466   .2390002    -1.53   0.125     .1219626    1.291926
                   |
    diag_delay_cat |    .125937   .0572148    -4.56   0.000     .0516942    .3068071
------------------------------------------------------------------------------------

with hospitals J and K omitted.

However if I use a slightly different dataset (removing the the single entries at the end)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int pid byte age str1(sex hosp_name) str8 hosp_type
100 88 "F" "H" "district"
101 92 "F" "H" "district"
102 83 "M" "H" "district"
103 22 "F" "H" "district"
104 36 "F" "H" "district"
105 23 "F" "H" "district"
106 54 "M" "H" "district"
107 22 "F" "H" "district"
108 24 "F" "H" "district"
109 40 "F" "H" "district"
110 35 "F" "H" "district"
111 54 "M" "I" "tertiary"
112 38 "M" "I" "tertiary"
113 69 "F" "I" "tertiary"
114 44 "F" "I" "tertiary"
115 78 "F" "I" "tertiary"
116 22 "M" "I" "tertiary"
117 18 "F" "I" "tertiary"
118 54 "M" "I" "tertiary"
119 78 "M" "I" "tertiary"
120 82 "M" "J" "base"    
121 75 "M" "J" "base"    
122 29 "F" "J" "base"    
123 33 "F" "J" "base"    
124  9 "M" "J" "base"    
125 28 "F" "J" "base"    
126  5 "F" "J" "base"    
127 34 "F" "J" "base"    
128 67 "F" "J" "base"    
129 82 "M" "J" "base"    
130 76 "F" "J" "base"    
end

then do the Cox analysis I get

Code:

 failure _d:  finaldxmadeevent == 1
   analysis time _t:  timetofdhrs

note: 9.hosp_name_cat omitted because of collinearity
note: 10.hosp_name_cat omitted because of collinearity
Iteration 0:   log likelihood = -498.92338
Iteration 1:   log likelihood = -446.34793
Iteration 2:   log likelihood = -436.90194
Iteration 3:   log likelihood = -436.51654
Iteration 4:   log likelihood = -436.51395
Iteration 5:   log likelihood = -436.51395
Refining estimates:
Iteration 0:   log likelihood = -436.51395

Cox regression -- Breslow method for ties

No. of subjects =          130                  Number of obs    =         130
No. of failures =          125
Time at risk    =         6543
                                                LR chi2(31)      =      124.82
Log likelihood  =   -436.51395                  Prob > chi2      =      0.0000

------------------------------------------------------------------------------------
                _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------------+----------------------------------------------------------------
               age |   1.003466   .0045866     0.76   0.449     .9945163    1.012496
    interventionyn |   16.63677    22.2671     2.10   0.036     1.207253    229.2661
     work_diagmade |   1.278301   .5334783     0.59   0.556     .5641544    2.896466
                   |
work_diagnostician |
                2  |   2.299247   .8330164     2.30   0.022     1.130305    4.677089
                3  |   .7099638   .3245912    -0.75   0.454     .2897824    1.739404
                4  |   9.025128    10.8631     1.83   0.068     .8529121     95.4998
                   |
         time_pres |
                2  |    .576285   .1569979    -2.02   0.043     .3378654    .9829489
                3  |   .7086565   .2483589    -0.98   0.326     .3565496    1.408483
                   |
     hosp_type_cat |
         district  |   86.06734   160.6045     2.39   0.017      2.22059    3335.865
         tertiary  |   1.861716   2.431047     0.48   0.634     .1440145    24.06693
                   |
     hosp_name_cat |
                A  |   17.32781   25.69319     1.92   0.054     .9475577    316.8704
                B  |   2.090338     3.6887     0.42   0.676     .0657885     66.4176
                C  |   1.260545   1.703829     0.17   0.864     .0891297    17.82765
                D  |     4.9162    3.42367     2.29   0.022      1.25559    19.24913
                F  |   15.52633   20.80612     2.05   0.041     1.123087    214.6468
                G  |   30.46847    55.1695     1.89   0.059     .8761396    1059.566
                H  |   .1188044   .0956162    -2.65   0.008      .024534    .5753029
                I  |          1  (omitted)
                J  |          1  (omitted)
                   |
         class_cat |
             CARD  |   1.140103   .8399422     0.18   0.859      .269056    4.831094
             DERM  |   .0914465   .1316415    -1.66   0.097     .0054428    1.536424
             ENDO  |   5.369706   5.922601     1.52   0.128      .618165    46.64409
              ENT  |   .7334168    .697303    -0.33   0.744     .1137792    4.727579
              GIT  |   .5348655   .4225605    -0.79   0.428     .1137022    2.516057
            NEURO  |   .2021763   .1629472    -1.98   0.047     .0416573    .9812273
              O&G  |   2.583215   2.129421     1.15   0.250       .51344    12.99665
             OPTH  |    .050093   .0748447    -2.00   0.045     .0026791    .9366326
             ORTH  |   .4921213   .3442852    -1.01   0.311     .1249041    1.938955
             RESP  |   .4831219   .3605901    -0.97   0.330     .1118771    2.086279
             RHEU  |   .2495317   .2809191    -1.23   0.218     .0274698    2.266709
                   |
     hosp_dept_cat |
               OP  |   .0925485   .0525888    -4.19   0.000     .0303873     .281869
             WARD  |   .4003935   .2408648    -1.52   0.128     .1231486    1.301801
                   |
    diag_delay_cat |   .1281343   .0579422    -4.54   0.000     .0528144    .3108694
------------------------------------------------------------------------------------

with I and J omitted.

I'm curious how/why Stata chooses these different hospital names to omit. I guess this is just the way the algorithms go but it always chooses two categories to omit no matter how I change the data.

If you leave hospital type out of the model, no hospital names are dropped.

Thanks and regards

Chris

Last edited by Chris Dalton; 11 Jan 2024, 22:50.

Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17708

12 Jan 2024, 00:51

Chris:
probably it depends on how many variables contributes to multicollinearity, as in the following toy-example:

Code:

. use "C:\Program Files\Stata18\ado\base\a\auto.dta"
(1978 automobile data)

. sysuse auto.dta
(1978 automobile data)

. reg price i.foreign foreign
note: foreign omitted because of collinearity.

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =      0.17
       Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
    Residual |   633558013        72  8799416.85   R-squared       =    0.0024
-------------+----------------------------------   Adj R-squared   =   -0.0115
       Total |   635065396        73  8699525.97   Root MSE        =    2966.4

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
     foreign |          0  (omitted)
       _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
------------------------------------------------------------------------------

. gen inv_foreign=0 if foreign==1
(52 missing values generated)

. replace inv_foreign=1 if inv_foreign==.
(52 real changes made)

. reg price i.foreign foreign inv_foreign
note: foreign omitted because of collinearity.
note: inv_foreign omitted because of collinearity.

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =      0.17
       Model |  1507382.66         1  1507382.66   Prob > F        =    0.6802
    Residual |   633558013        72  8799416.85   R-squared       =    0.0024
-------------+----------------------------------   Adj R-squared   =   -0.0115
       Total |   635065396        73  8699525.97   Root MSE        =    2966.4

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |   312.2587   754.4488     0.41   0.680    -1191.708    1816.225
     foreign |          0  (omitted)
 inv_foreign |          0  (omitted)
       _cons |   6072.423    411.363    14.76   0.000     5252.386     6892.46
------------------------------------------------------------------------------

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

12 Jan 2024, 09:02

This problem is arising because you have both hospital type, which is an invariant characteristic of a given hospital, and hospital name in the data. So in addition to dropping one category of hospital name as the base category, there is a colinearity among the indicators for hospital type and hospital name which necessitates omitting two more variables in order to identify the model. Mathematically, it could be done by removing two hospital names, or two of the hospital types, or one of each, or by adding some constraints on these variables. Those would produce the same predictive model, though the coefficients would be different among the colinear variables.

As for how Stata chooses which of these options to take, that is internal to the algorithms and is not disclosed. Generally speaking it seems to choose variables near the end of the variable list in the estimation command. But in any case that behavior is not guaranteed, and even if we knew what it is, it would be hazardous to rely on that--it could change with the next update! You can, of course, control this yourself by using the io. operator. (-help fvvarlist- if unfamiliar with this)

But better still is to note include colinear variables in your model since the resulting coefficients are meaningless. If this were my project, I would omit the hospital type variable--you cannot estimate its effects while also including indicators for each hospital--it is mathematically impossible.
1 like
Comment
Chris Dalton

Join Date: Mar 2018

Posts: 31
#4

12 Jan 2024, 19:40

Thank you both for your helpful explanations here.
I suspected it would require using just one of these predictors, not both.
Regards
Chris
Comment

Announcement

How Stata deals with multicollinearity

Comment

Comment

Comment