Dealing with multicollinearity in cox regression

Vera Mikhailovna

Join Date: Nov 2021

Posts: 26
#1

Dealing with multicollinearity in cox regression

18 Jan 2022, 13:09

I am running a Cox regression in Stata. One of the things I would like to show with data is how adoption and abandonments of Zoom in different branches of a multinational company impact the adoption. The diagnostic statistics indicate that two variables of interest are highly multicollinear (0.72), I use

Code:

estat vce, corr

I tried to add each variable separately to the model, but the results do not match with the fully saturated model.

Specifically, if I run;

Code:

stcox Age Size NoAdoption NoAbandonment i.date, strata(company) vce(cluster company)

The coefficient on NoAdoption is positive and the coefficient on NoAbandonment is negative

If I run:

Code:

stcox Age Size NoAdoption i.date, strata(company) vce(cluster company)

The coefficient on NoAdoption is positive, which is consistent with the full model above

If I run:

Code:

stcox Age Size NoAbandonment i.date, strata(company) vce(cluster company)

The coefficient on NoAbandonment is positive, which is not consistent with the full model above and not consistent what the theory would predict.

Is there a way to deal with Multicollinearity in Cox models (e.g. L1 or L2 regularization) ?
I thought about Orthogonalization but learned that it is not considered to be a good practice in my field (or maybe in econometrics in general).
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17601
#2

19 Jan 2022, 10:25

Vera:
the main issue here is whether interaction gives a fair(er) and true(r) view (vs. no interaction) of the data generating process you're investigating on the grounds of the existing literature.
That said, it is expected that interactions shows high collinearity.
I would check whether the standard errors of the conditional main effects and interactions are weird vs. some standard which is common in your research field.

Kind regards,
Carlo
(StataNow 18.5)
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#3

19 Jan 2022, 16:05

Vera: Although you haven't shown us your output (please read the FAQ), I can't see a situation where you'd be justified dropping one of those variables. It appears there are three Zoom states with two being described by NoAdoption and NoAbandonment. And there must be an omitted category. If so, you need to include both of those variables in the model. That they happen to have a correlation around .72 is neither here nor there. Did someone tell you not to include both if the correlation is above a certain level? That's not good advice.

If you show the Stata output I can comment more. It sounds like when you include both you get results that seem sensible. Dropping one of these variables is a bad idea. How precise are the estimates? If the confidence intervals are sufficiently narrow I wouldn't even look at the correlations among regressors.
1 like
Comment
Vera Mikhailovna

Join Date: Nov 2021

Posts: 26
#4

29 Jan 2022, 05:12

Dear Professor Lazzaro and Professor Wooldridge, apologies for not being clear, actually NoAdoption and NoAbondenment variables are count variables where NoAdoption corresponds to the number of branches adoption Zoom, whereas NoAbondenment corresponds to the number of branches abandoning the Zoom. I read the paper by Kalnins (2018) where he recommends adding variables to model sequentially if there is multicollinearity Kalnins, A. (2018). Multicollinearity: How common factors cause Type 1 errors in multivariate regression. Strategic Management Journal, 39(8), 2362-2385.
I am also attaching the new results.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#5

29 Jan 2022, 09:22

You shouldn't just blindly follow rules because there are no universal rules for doing convincing empirical work. The results in column (1) are clearly preferred. Notice that the standard error on LnNoAdoption is actually smaller when LnNoAbandon is included! So the correlation between them isnt' even causing the precision of the estimates to drop! That LnNoAbandon could be used to justify dropping it, I guess, but there's no need. NoAdoption has a positive (you need to compute how big the effect is) and marginally statistically significant effect. NoAbandon's effect is not close to being statistically significant.

Nothing about the correlation between those two variables would change the results I would focus on.
1 like
Comment
Vera Mikhailovna

Join Date: Nov 2021

Posts: 26
#6

30 Jan 2022, 12:12

Dear Professor Wooldridge, thank you very much for your reply. Is there a paper of yours that makes a similar point, or is there another source other than the Statalist forum that I might cite?
Comment

Announcement

Dealing with multicollinearity in cox regression

Comment

Comment

Comment

Comment

Comment