Hello
I am investigating modelling diagnostic delay in patients using a time to event model. I am just using a fake dataset to begin with, made up by me. I have hospital names (just A, B, C...etc) and hospital type (district, base and tertiary) in order of size and presumed capability) and there is some correlation between these variables as you would expect (although thet correrlation coefficient is only about 0.3).
Depending on how I compose the data, Stata will drop 2 of the hospital names because of collinearity. For example, if I have ( there are other independent variables but dataex won't display them all)
then I use
I get
with hospitals J and K omitted.
However if I use a slightly different dataset (removing the the single entries at the end)
then do the Cox analysis I get
with I and J omitted.
I'm curious how/why Stata chooses these different hospital names to omit. I guess this is just the way the algorithms go but it always chooses two categories to omit no matter how I change the data.
If you leave hospital type out of the model, no hospital names are dropped.
Thanks and regards
Chris
I am investigating modelling diagnostic delay in patients using a time to event model. I am just using a fake dataset to begin with, made up by me. I have hospital names (just A, B, C...etc) and hospital type (district, base and tertiary) in order of size and presumed capability) and there is some correlation between these variables as you would expect (although thet correrlation coefficient is only about 0.3).
Depending on how I compose the data, Stata will drop 2 of the hospital names because of collinearity. For example, if I have ( there are other independent variables but dataex won't display them all)
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int pid byte age str1(sex hosp_name) str8 hosp_type 100 88 "F" "H" "district" 101 92 "F" "H" "district" 102 83 "M" "H" "district" 103 22 "F" "H" "district" 104 36 "F" "H" "district" 105 23 "F" "H" "district" 106 54 "M" "H" "district" 107 22 "F" "H" "district" 108 24 "F" "H" "district" 109 40 "F" "H" "district" 110 35 "F" "H" "district" 111 54 "M" "I" "tertiary" 112 38 "M" "I" "tertiary" 113 69 "F" "I" "tertiary" 114 44 "F" "I" "tertiary" 115 78 "F" "I" "tertiary" 116 22 "M" "I" "tertiary" 117 18 "F" "I" "tertiary" 118 54 "M" "I" "tertiary" 119 78 "M" "I" "tertiary" 120 82 "M" "J" "base" 121 75 "M" "J" "base" 122 29 "F" "J" "base" 123 33 "F" "J" "base" 124 9 "M" "J" "base" 125 28 "F" "J" "base" 126 5 "F" "J" "base" 127 34 "F" "J" "base" 128 67 "F" "J" "base" 129 82 "M" "J" "base" 130 76 "F" "J" "base" 131 14 "M" "K" "tertiary" 132 52 "F" "L" "district" end
Code:
stcox age interventionyn work_diagmade ib1.work_diagnostician ib1.time_pres ib1.hosp_type_cat ib5.hosp_name_cat ib6.class_cat ib1.hosp_dept_cat diag_delay_cat
Code:
failure _d: finaldxmadeevent == 1 analysis time _t: timetofdhrs note: 10.hosp_name_cat omitted because of collinearity note: 11.hosp_name_cat omitted because of collinearity Iteration 0: log likelihood = -508.85853 Iteration 1: log likelihood = -455.5346 Iteration 2: log likelihood = -445.76463 Iteration 3: log likelihood = -445.36341 Iteration 4: log likelihood = -445.36065 Iteration 5: log likelihood = -445.36065 Refining estimates: Iteration 0: log likelihood = -445.36065 Cox regression -- Breslow method for ties No. of subjects = 132 Number of obs = 132 No. of failures = 127 Time at risk = 6599 LR chi2(33) = 127.00 Log likelihood = -445.36065 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------------+---------------------------------------------------------------- age | 1.003524 .004595 0.77 0.442 .9945586 1.012571 interventionyn | 16.86043 22.57084 2.11 0.035 1.222859 232.4668 work_diagmade | 1.284512 .537458 0.60 0.550 .5656964 2.916708 | work_diagnostician | 2 | 2.314053 .8398244 2.31 0.021 1.136193 4.712967 3 | .704744 .3231542 -0.76 0.445 .2868933 1.731181 4 | 9.244435 11.13115 1.85 0.065 .872881 97.9052 | time_pres | 2 | .5770304 .1574091 -2.02 0.044 .3380632 .9849165 3 | .7103709 .2494742 -0.97 0.330 .3569052 1.413896 | hosp_type_cat | district | 86.27767 161.0894 2.39 0.017 2.221346 3351.048 tertiary | 1.685938 2.812441 0.31 0.754 .0641045 44.33994 | hosp_name_cat | A | 18.94269 32.88749 1.69 0.090 .6304086 569.1953 B | 2.255655 4.541459 0.40 0.686 .0436006 116.6952 C | 1.240626 1.678445 0.16 0.873 .0875082 17.58867 D | 5.45929 6.9921 1.33 0.185 .4435492 67.19399 F | 17.18861 28.00398 1.75 0.081 .7054219 418.8251 G | 30.2716 54.84421 1.88 0.060 .8687224 1054.848 H | .1174483 .0947504 -2.65 0.008 .0241628 .570882 I | 1.086235 1.284395 0.07 0.944 .1070138 11.02575 J | 1 (omitted) K | 1 (omitted) L | .8346748 .9255692 -0.16 0.871 .0949777 7.335218 | class_cat | CARD | 1.139817 .8413863 0.18 0.859 .2682239 4.843645 DERM | .089733 .1292684 -1.67 0.094 .0053299 1.510722 ENDO | 5.394387 5.95128 1.53 0.127 .6206777 46.88329 ENT | .7240182 .6892986 -0.34 0.734 .1120383 4.678778 GIT | .5346554 .4234097 -0.79 0.429 .1132353 2.524446 NEURO | .1985368 .1604488 -2.00 0.045 .0407321 .9677098 O&G | 2.595949 2.143298 1.16 0.248 .5146561 13.09408 OPTH | .0492302 .0735976 -2.01 0.044 .0026285 .9220434 ORTH | .4876306 .3422696 -1.02 0.306 .1232054 1.929977 RESP | .4791978 .358318 -0.98 0.325 .1106707 2.074899 RHEU | .2413165 .2718251 -1.26 0.207 .0265321 2.194837 | hosp_dept_cat | OP | .0904296 .0515044 -4.22 0.000 .0296147 .2761303 WARD | .3969466 .2390002 -1.53 0.125 .1219626 1.291926 | diag_delay_cat | .125937 .0572148 -4.56 0.000 .0516942 .3068071 ------------------------------------------------------------------------------------
However if I use a slightly different dataset (removing the the single entries at the end)
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int pid byte age str1(sex hosp_name) str8 hosp_type 100 88 "F" "H" "district" 101 92 "F" "H" "district" 102 83 "M" "H" "district" 103 22 "F" "H" "district" 104 36 "F" "H" "district" 105 23 "F" "H" "district" 106 54 "M" "H" "district" 107 22 "F" "H" "district" 108 24 "F" "H" "district" 109 40 "F" "H" "district" 110 35 "F" "H" "district" 111 54 "M" "I" "tertiary" 112 38 "M" "I" "tertiary" 113 69 "F" "I" "tertiary" 114 44 "F" "I" "tertiary" 115 78 "F" "I" "tertiary" 116 22 "M" "I" "tertiary" 117 18 "F" "I" "tertiary" 118 54 "M" "I" "tertiary" 119 78 "M" "I" "tertiary" 120 82 "M" "J" "base" 121 75 "M" "J" "base" 122 29 "F" "J" "base" 123 33 "F" "J" "base" 124 9 "M" "J" "base" 125 28 "F" "J" "base" 126 5 "F" "J" "base" 127 34 "F" "J" "base" 128 67 "F" "J" "base" 129 82 "M" "J" "base" 130 76 "F" "J" "base" end
Code:
failure _d: finaldxmadeevent == 1 analysis time _t: timetofdhrs note: 9.hosp_name_cat omitted because of collinearity note: 10.hosp_name_cat omitted because of collinearity Iteration 0: log likelihood = -498.92338 Iteration 1: log likelihood = -446.34793 Iteration 2: log likelihood = -436.90194 Iteration 3: log likelihood = -436.51654 Iteration 4: log likelihood = -436.51395 Iteration 5: log likelihood = -436.51395 Refining estimates: Iteration 0: log likelihood = -436.51395 Cox regression -- Breslow method for ties No. of subjects = 130 Number of obs = 130 No. of failures = 125 Time at risk = 6543 LR chi2(31) = 124.82 Log likelihood = -436.51395 Prob > chi2 = 0.0000 ------------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------------+---------------------------------------------------------------- age | 1.003466 .0045866 0.76 0.449 .9945163 1.012496 interventionyn | 16.63677 22.2671 2.10 0.036 1.207253 229.2661 work_diagmade | 1.278301 .5334783 0.59 0.556 .5641544 2.896466 | work_diagnostician | 2 | 2.299247 .8330164 2.30 0.022 1.130305 4.677089 3 | .7099638 .3245912 -0.75 0.454 .2897824 1.739404 4 | 9.025128 10.8631 1.83 0.068 .8529121 95.4998 | time_pres | 2 | .576285 .1569979 -2.02 0.043 .3378654 .9829489 3 | .7086565 .2483589 -0.98 0.326 .3565496 1.408483 | hosp_type_cat | district | 86.06734 160.6045 2.39 0.017 2.22059 3335.865 tertiary | 1.861716 2.431047 0.48 0.634 .1440145 24.06693 | hosp_name_cat | A | 17.32781 25.69319 1.92 0.054 .9475577 316.8704 B | 2.090338 3.6887 0.42 0.676 .0657885 66.4176 C | 1.260545 1.703829 0.17 0.864 .0891297 17.82765 D | 4.9162 3.42367 2.29 0.022 1.25559 19.24913 F | 15.52633 20.80612 2.05 0.041 1.123087 214.6468 G | 30.46847 55.1695 1.89 0.059 .8761396 1059.566 H | .1188044 .0956162 -2.65 0.008 .024534 .5753029 I | 1 (omitted) J | 1 (omitted) | class_cat | CARD | 1.140103 .8399422 0.18 0.859 .269056 4.831094 DERM | .0914465 .1316415 -1.66 0.097 .0054428 1.536424 ENDO | 5.369706 5.922601 1.52 0.128 .618165 46.64409 ENT | .7334168 .697303 -0.33 0.744 .1137792 4.727579 GIT | .5348655 .4225605 -0.79 0.428 .1137022 2.516057 NEURO | .2021763 .1629472 -1.98 0.047 .0416573 .9812273 O&G | 2.583215 2.129421 1.15 0.250 .51344 12.99665 OPTH | .050093 .0748447 -2.00 0.045 .0026791 .9366326 ORTH | .4921213 .3442852 -1.01 0.311 .1249041 1.938955 RESP | .4831219 .3605901 -0.97 0.330 .1118771 2.086279 RHEU | .2495317 .2809191 -1.23 0.218 .0274698 2.266709 | hosp_dept_cat | OP | .0925485 .0525888 -4.19 0.000 .0303873 .281869 WARD | .4003935 .2408648 -1.52 0.128 .1231486 1.301801 | diag_delay_cat | .1281343 .0579422 -4.54 0.000 .0528144 .3108694 ------------------------------------------------------------------------------------
I'm curious how/why Stata chooses these different hospital names to omit. I guess this is just the way the algorithms go but it always chooses two categories to omit no matter how I change the data.
If you leave hospital type out of the model, no hospital names are dropped.
Thanks and regards
Chris
Comment