Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Discrete time proportional hazard analysis: pgmhaz and hshaz with age as duration variable

    Hi!
    Apologies if my question seems to be too easy but I am using stata for my bachelor thesis and couldn’t find any help in the manuals etc.
    I am using discrete time proportional hazard models to estimate the effect of firm age on takeover hazard: pgmhaz and hshaz.
    Takeover is my dependent variable ang gvkey my id variable. Age is computed as years elapsed plus one since first listing.
    My regression will also include a set of time-varying control variables like sales, total assets etc.. So for illustration a fictive extract from my sample (sample period begins 1978):

    input gvkey Takeover Age Year X1 X2
    1 0 1 1990 14.5 25
    1 0 2 1991 23.2 26
    1 1 3 1992 72.2 29.1
    2 0 5 1978 19 17
    2 1 6 1980 19 18.2
    3 0 1 2000 12.3 9.4
    3 0 4 2003 12 10
    3 0 5 2004 12 11
    3 0 7 3006 34 23.3
    end

    As i am assuming that every firm is at risk of takeover since listing, age is my duration/sequence variable.
    There happen to be gaps of many years between observations (e.g. gvkey= 3: between age 1 and 4 no observations) and i do have time varying variables, so i didn't wanted to fillin the missing variable values with the previous values. That's why i ran the regressions without stsplitting my data.
    I assume there should be no problem with left trunctuation, since there will be no obervations for the age periods prior to entry to study (e.g. gvkey=2)
    For the baseline hazard and duration dependence i created ln_age=ln(age) and ran the regression with the variables:

    pgmhaz8 ln_age X1 X2, id(gvkey) seq(age) dead(takeover) nolog
    hshaz ln_age X1 X2, id(gvkey) seq(age) dead(takeover) nolog

    Both regressions run without problems but the results i'm getting are inconsistent with my expected results. I'm getting a high positive coefficient for ln_age. I'm following an analysis outlined in a previous literature. I know that my results may be different due to various reasons other than the regression specification itself, so my question is, did I make a mistake in the steps I described above or is the fact, that I did not stssplit my data influencing my results? Should i create a separate duration variable and use it to describe duration dependence instead of using age?

    Thank you in advance!

  • #2
    Lisa: welcome to Statalist. Please take a few minutes to read the Forum FAQ and digest the tips that will help interested readers to help you. Note in particular the FAQ's recommendation to use CODE delimiters to report Stata input and output (to improve legibility and to facilitate cut/paste) and the mention of using dataex (on SSC) to derive sample extracts to supplement your question. Also, you should say where user-written programs come from. In this case, pgmhaz8 and hshaz are on SSC,.

    Your sample data are hard to decipher, but I suspect that you have not set them up correctly. You need to have one row in your dataset corresponding to each year that each of your firms is at risk of bankruptcy from the year first at risk. (You do not need stsplit to create this data set; expand can be used.) What appears in the seq() option should be a variable with values for each firm that are consecutive integers and you do not appear to have this in your "age" variable. (The integers start at 1 if the data are not left-truncated; and potentially at some larger integer if left-truncated.) More about all this in the materials at http://www.iser.essex.ac.uk/survival-analysis.

    Comment


    • #3
      Thank you Stephen for your comments. As you suggested i used dataex to create the example:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input double(gvkey fyear sale at) float(age takeover)
      1003 1983   13.793    8.529  1 0
      1003 1986   36.308   14.586  4 0
      1003 1987   37.356   16.042  5 0
      1003 1988   32.808    16.28  6 0
      1005 1978    8.185    7.164  7 0
      1005 1979   13.622    9.251  8 0
      1005 1980   23.382   15.504  9 0
      1005 1981   35.921    24.48 10 1
      end
      expand age
      Just as explanation: gvkey is the firm id. So gvkey=1003 refers to one firm and for firm 1003 i have 4 rows of data from the fiscal years 1983,1986, 1987 and 1988 with the corresponding varying sales number and at number. So firm 1003 entered the study in 1983 when it became listed and therefore at risk of takeover and left the study in 1988 when it delisted which corresponds to 6 years of studytime/age.

      If i expand my data on age, i will for example get a total of 16 data rows for the firm 1003. stata will create 3 additional data rows for age=4, 4 additional ones for age=5 and 5 for age=6. But from my understanding i thought i just need to fillin the missing "age rows". So as for gvkey = 1003 one row for age=2 and one for age=3, making age a consecutive number. As for gvkey=1005 the structure should already be fine since age here is consecutive and it starts at age 7 since my data is left trunctuated and ends at age 10 when the firm was taken over.
      Am i wrong in my asssumption? Is the structure created by the "expand age" command the right one?

      Thank you for the reference to the materials! I have already encountered them when i first started with my thesis and they were very helpful. But since the lecture and also the exercise do files provided focuse on datasets which only have 1 row of data per person and then uses expand to create multiple rows depending on studytime. I'm unsure about how to handle my data which already has multiple records/rows per firm.

      Comment


      • #4
        Thanks for the more legible dataex output! The most important thing in what I said was:
        You need to have one row in your dataset corresponding to each year that each of your firms is at risk of bankruptcy from the year first at risk.
        So, it now seems that, yes, you don't have to expand your data; what you need to do is "fill in" the data set with additional observations on year (where there are currently gaps) and then ensure the age variable is correct in all rows for each firm.

        The more substantive question is: what do you know about the firms in those years that you don't currently list? Put differently, what we've discussed so far can help get your survival time and elapsed duration counter correctly set-up ... but it you fill in rows for firms, what are the values of "sale" in those currently-not-included years? The data might be available (it's hard to tell from what you've told us) or perhaps it's missing altogether, in which case you might have to impute some values somehow.

        Comment

        Working...
        X