Cox proportional hazard model output

Peter BX

Join Date: Dec 2024

Posts: 2
#1

Cox proportional hazard model output

20 Dec 2024, 21:59

I am using the stcox command to perform a CPH regression in stata. The code sample looks like this:

stset fyear, failure(adopt == 1) id(state)

stcox state_lnGDP state_lnpop personal_income_pc , vce(cluster state)

stcox state_lnGDP state_lnpop personal_income_pc unemployment_rate , vce(cluster state)

stcox state_lnGDP state_lnpop personal_income_pc unemployment_rate capx_growth_state, vce(cluster state)

Now my question is: Why is it that every time I open and re-run the do.file, I always get different hazard ratios or coefficients for each independent variable than the last time? And the significance of each variable also changes? How can I obtain consistent results?
Tags: Cox proportional hazard
Clyde Schechter

Join Date: Apr 2014

Posts: 29587
#2

20 Dec 2024, 22:36

You do not show any example data, nor do you disclose the commands that precede the -stset- command. Without that information nothing specific can be said. What can be said is that this kind of behavior is usually the result of an indeterminate sort of the data. Be aware that if you run -sort variable_list- and the variables in that list do not identify unique observations in the data, the order of the resulting sort is incompletely specified. In situations like this, Stata randomizes the sort order (within the order determined by the variable list you specified). So, for example a command like:

Code:

bysort x y: keep if _n == 1

can, if the pair x y does not determine unique observations in the data set, result in different sort orders each time the command is run. Then when only the first observation in each x y group is kept, it will be a different observation each time the code is executed. So the data set is, from that point on, indeterminate and may give different results for all subsequent calculations each time the code runs.

So review all the code that precedes your -stset- command and check for any that sort the data. Carefully scrutinize the sorting variables. If you have any such commands that -sort- with a variable list that does not completely specify unique observations, you have the potential for this kind of problem. Note that this can also be subtle: some commands may -sort- the data "behind the scenes" without containing the word sort directly.

If you cannot identify the source of your problem with this advice, please post back showing all the commands that precede your -stset- command and also example data. Use the -dataex- command to show the example data. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Last edited by Clyde Schechter; 20 Dec 2024, 22:39.
1 like
Comment
Peter BX

Join Date: Dec 2024

Posts: 2
#3

21 Dec 2024, 08:39

Hi Clyde, thank you for your reply!

Here are my complete command and please take a look.

use "/Users/apple/Desktop/ Firm-level data.dta"

///Generate state-year level key interest variables based on firm-level data
bysort state fyear: egen state_IS1= mean(IS1_w)
bysort state fyear: egen state_IS2= mean(IS2_w)

bysort state fyear: gen dup = _n
drop if dup>1

//////Merge with state-level controls
merge m:1 fyear state using "/Users/apple/Desktop/state-level data.dta"
keep if _merge ==3
drop _merge

gen state_lnGDP = ln(gdp)
gen state_lnpop = ln(pop)

/////Define faillure event variable, which denote for staggered state adopt the law in a given year or not.

gen Event = 0
replace Event = 1 if (state == "CA" & fyear >= 2004) | ///
(state == "NJ" & fyear >= 2009) | ///
(state == "RI" & fyear >= 2014) | ///
(state == "NY" & fyear >= 2018) | ///
(state == "DC" & fyear >= 2020) | ///
(state == "WA" & fyear >= 2020) | ///
(state == "MA" & fyear >= 2021)

stset fyear, failure(Event ==1) id(state)

set seed 12345

////Raw Cox regression
stcox state_lnGDP state_lnpop personal_income_pc unemployment_rate capx_growth_state, vce(cluster state)

////Cox regression with key variable state_IS1
stcox state_IS1 state_lnGDP state_lnpop personal_income_pc unemployment_rate capx_growth_state, vce(cluster state)

////Cox regression with key variable state_IS2
stcox state_IS2 state_lnGDP state_lnpop personal_income_pc unemployment_rate capx_growth_state , vce(cluster state)

My project is to ensure all independent variables are insignificant from above three regression models. The issue is that the hazard ratios or coefficients still change when I exit and re-run the whole command, sometimes the new results are insignificant (all vars) but sometimes one or two of them become significant.

Last edited by Peter BX; 21 Dec 2024, 08:44.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29587
#4

21 Dec 2024, 08:59

OK. You have a whole bunch of -bysort- commands at the beginning. The ones involving -egen, mean()- are not problematic, because the mean does not depend on the order of the data. But there is a potential problem with:

Code:

bysort state fyear: gen dup = _n drop if dup>1

If there can be more than one observation with the same values of state and year, the sort is indeterminate. The next command drops all but the first of those observations. That means that you are keeping a different observation for at least some combinations of state and year each time you run the code. Everything from that point on is irreproducible.

The conclusion I draw is that you have multiple observation for at least some combination of state and fyear, and that those multiple observations are not pure duplicates on all variables. They have conflicting values on some of the other variables, and your code is randomly picking a different one each time you run it, leading to different results from your Cox regressions.

What to do? It really depends on the nature of the data. Why are there multiple observations for the same value of state and fyear? Do they represent data errors? If so, you should review all of the data management that created that data set and fix it so that only one observation for each state-fyear combination is in the data, and it is the correct one!

If they are not data errors but are really supposed to be there, then your approach to the problem is wrong: selecting a single arbitrary one from the group is not sensible. So you need to rethink what your intention was for the code called out above. Perhaps the Cox regressions should be done with all of the observations. Or perhaps some deletions are needed, but based on something that systematically selects the correct ones, not random ones. For example, perhaps in addition to fyear your data is at the level of the quarter, and you need to run your regressions only on first quarter data. In that case instead of the called out code you would need something like -keep if quarter == 1-. Or perhaps the multiple entries for state and fyear are due to multiple cities and the regressions are to be done only including the capital cities. Or something like that.
Comment

Announcement

Cox proportional hazard model output

Comment

Comment

Comment