Hi!
Apologies if my question seems to be too easy but I am using stata for my bachelor thesis and couldn’t find any help in the manuals etc.
I am using discrete time proportional hazard models to estimate the effect of firm age on takeover hazard: pgmhaz and hshaz.
Takeover is my dependent variable ang gvkey my id variable. Age is computed as years elapsed plus one since first listing.
My regression will also include a set of time-varying control variables like sales, total assets etc.. So for illustration a fictive extract from my sample (sample period begins 1978):
input gvkey Takeover Age Year X1 X2
1 0 1 1990 14.5 25
1 0 2 1991 23.2 26
1 1 3 1992 72.2 29.1
2 0 5 1978 19 17
2 1 6 1980 19 18.2
3 0 1 2000 12.3 9.4
3 0 4 2003 12 10
3 0 5 2004 12 11
3 0 7 3006 34 23.3
end
As i am assuming that every firm is at risk of takeover since listing, age is my duration/sequence variable.
There happen to be gaps of many years between observations (e.g. gvkey= 3: between age 1 and 4 no observations) and i do have time varying variables, so i didn't wanted to fillin the missing variable values with the previous values. That's why i ran the regressions without stsplitting my data.
I assume there should be no problem with left trunctuation, since there will be no obervations for the age periods prior to entry to study (e.g. gvkey=2)
For the baseline hazard and duration dependence i created ln_age=ln(age) and ran the regression with the variables:
pgmhaz8 ln_age X1 X2, id(gvkey) seq(age) dead(takeover) nolog
hshaz ln_age X1 X2, id(gvkey) seq(age) dead(takeover) nolog
Both regressions run without problems but the results i'm getting are inconsistent with my expected results. I'm getting a high positive coefficient for ln_age. I'm following an analysis outlined in a previous literature. I know that my results may be different due to various reasons other than the regression specification itself, so my question is, did I make a mistake in the steps I described above or is the fact, that I did not stssplit my data influencing my results? Should i create a separate duration variable and use it to describe duration dependence instead of using age?
Thank you in advance!
Apologies if my question seems to be too easy but I am using stata for my bachelor thesis and couldn’t find any help in the manuals etc.
I am using discrete time proportional hazard models to estimate the effect of firm age on takeover hazard: pgmhaz and hshaz.
Takeover is my dependent variable ang gvkey my id variable. Age is computed as years elapsed plus one since first listing.
My regression will also include a set of time-varying control variables like sales, total assets etc.. So for illustration a fictive extract from my sample (sample period begins 1978):
input gvkey Takeover Age Year X1 X2
1 0 1 1990 14.5 25
1 0 2 1991 23.2 26
1 1 3 1992 72.2 29.1
2 0 5 1978 19 17
2 1 6 1980 19 18.2
3 0 1 2000 12.3 9.4
3 0 4 2003 12 10
3 0 5 2004 12 11
3 0 7 3006 34 23.3
end
As i am assuming that every firm is at risk of takeover since listing, age is my duration/sequence variable.
There happen to be gaps of many years between observations (e.g. gvkey= 3: between age 1 and 4 no observations) and i do have time varying variables, so i didn't wanted to fillin the missing variable values with the previous values. That's why i ran the regressions without stsplitting my data.
I assume there should be no problem with left trunctuation, since there will be no obervations for the age periods prior to entry to study (e.g. gvkey=2)
For the baseline hazard and duration dependence i created ln_age=ln(age) and ran the regression with the variables:
pgmhaz8 ln_age X1 X2, id(gvkey) seq(age) dead(takeover) nolog
hshaz ln_age X1 X2, id(gvkey) seq(age) dead(takeover) nolog
Both regressions run without problems but the results i'm getting are inconsistent with my expected results. I'm getting a high positive coefficient for ln_age. I'm following an analysis outlined in a previous literature. I know that my results may be different due to various reasons other than the regression specification itself, so my question is, did I make a mistake in the steps I described above or is the fact, that I did not stssplit my data influencing my results? Should i create a separate duration variable and use it to describe duration dependence instead of using age?
Thank you in advance!
Comment