Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about streg

    Hello All:


    I realize that this is a statistics question rather than a Stata question, and I hope this is OK. Here is my situation. I have some event history data. I have approximately 100 cases, each of which is an organization. Each line of data is an organization-year. My dependent variable is a dummy that takes a value of 0 if the group is alive in the given year, and a 1 the year that the group dies. As soon as a group gets a 1 on the dependent variable, it exits the data.
    I am using the streg command to estimate a model that seeks to determine what sorts of things affect group death. One of the variables I would like to include in the model is AGE. So, for the first year a group is in the data it would have an AGE value of 1, and so on. I realize that estimating a Cox model with an AGE variable is impossible. However, is using the streg command and including an AGE variable in this manner a reasonable thing to do? I appreciate any feedback you can give me.

  • #2
    Hello Tony,

    Maybe you could get more help if you present an example of the display of your data, the commands you already used and the output.

    This being said, I wish to comment briefly on four points:

    First, and theoretically, I think Cox regression poses no obstacles for inserting "age" as a variable.
    Second, - streg - may be an interesting choice when estimating the baseline hazard is a matter of concern, or, among other possibilities, when you want to put accelerate failure time into a metric. You presented little information about your data, but, apparently, a parametric survival analysis seems not to be your what you need.
    Third, you said each line represents an organization but you also mentioned "groups". That was not clear to me. Are both the same thing, or "group" means "group of organizations"?
    Finally, excuse me if I may not have understood well your query, but I got the impression that what you mean by "age" could well be the "time variable", already specified when you performed the - stset- command.

    Hopefully that helps.

    Best,

    Marcos
    Best regards,

    Marcos

    Comment


    • #3
      Hi Marcos:
      Thanks so much for responding. Basically, I have time series cross-sectional data for around 100 organizations, and my goal is to find out what sorts of things cause groups to die. (Yes, in the other post I used the terms “organizations” and “groups” interchangeably. Sorry!) Here is what the data look like for a hypothetical case of an organization that was founded in 1989 and died in 1996. The lines of data would look like this:
      Group Name Year Dead IV1 IV2 Time (age)
      Group A 1989 0 55 0 1
      Group A 1990 0 53 1 2
      Group A 1991 0 58 1 3
      Group A 1992 0 36 1 4
      Group A 1993 0 43 0 5
      Group A 1994 0 67 0 6
      Group A 1995 0 45 1 7
      Group A 1996 1 23 1 8
      Dead is my dependent variable, and I have a few other independent variables, here denoted by IV1 and IV2. (In this hypothetical example, these are meaningless). Then, I have an age variable, which as you point out, is indeed the same as a time variable (which is indeed created when I stset the data).

      I hope this makes sense. When I estimate a cox model, using stcox, I get a standard error for the time estimate that is so huge as to be nonsensical. When I use streg instead, however, I get a nice estimate that is even significant. Perhaps I am in over my head here… Perhaps relogit is a better choice? I am just not sure. I thought streg made sense as really, all I want to do is determine which of my variables—including time(age)—increases or decreases the probability of death.
      As always, I would appreciate any and all guidance, and thanks for your patience with a neophyte… Tony

      Comment


      • #4
        Tony:
        -echoing especially one of Marcos' helpful recommendations, I would encourage you to post what you typed and what Stata gave you back (as per FAQ).
        - -streg- is plenty of parametrization choices (obviously, it'up to you -maybe following the research strategy that others paved in the past in dealing with your very same research topic- selecting which one makes sense in your research field) which work differently from semi-parametric Cox regression. Hence. no wonder that you have found out wide differences between these two approaches.
        Kind regards,
        Carlo
        (StataNow 18.5)

        Comment


        • #5
          stcox explicitly does not estimate an effect of time, it just adjusts for this in a non-parametric manner. This is a strength and a weakness. It is a strength in the sense that you cannot make an error in something you do not estimate. It is a weakness in the sense that you cannot interpret something you do not estimate. Since you seem to be interested in the effect of time (age) a Cox model is not for you; it is a great model but it does not answer your question. Instead you can look for the parametric survival models, or you could look for stpm2 (type in Stata findit stpm2 and follow the instructions) for a more flexible alternative.
          ---------------------------------
          Maarten L. Buis
          University of Konstanz
          Department of history and sociology
          box 40
          78457 Konstanz
          Germany
          http://www.maartenbuis.nl
          ---------------------------------

          Comment


          • #6
            Hello All... Thanks so much for your help thus far!

            1. I began with stset.


            Then I got this:

            stset duration, id(groupid) failure(died)
            id: groupid
            failure event: died != 0 & died < .
            obs. time interval: (duration[_n-1], duration]
            exit on or before: failure
            ------------------------------------------------------------------------------
            2559 total obs.
            0 exclusions
            ------------------------------------------------------------------------------
            2559 obs. remaining, representing
            137 subjects
            38 failures in single failure-per-subject data
            2559 total analysis time at risk, at risk from t = 0
            earliest observed entry t = 0
            last observed exit t = 55

            As this output suggests, my cases are individual groups (each of which has an ID number), and the event is died. Died is equal to 0 for every year in which a group was alive, and 1 for the year in which the group died. After Died turns to 1 (that is, the group dies), the group exits the dataset.

            2. Next, I run this model:

            streg DensityLag DensityLagSq Salience CRMood CumHaz recent_comp distant_comp duration, distribution(weibull)

            The output looks like this:

            failure _d: died
            analysis time _t: duration
            id: groupid

            Fitting constant-only model:

            Iteration 0: log likelihood = -109.22803
            Iteration 1: log likelihood = -108.60662
            Iteration 2: log likelihood = -108.60451
            Iteration 3: log likelihood = -108.60451

            Fitting full model:

            Iteration 0: log likelihood = -108.60451
            Iteration 1: log likelihood = -102.48844
            Iteration 2: log likelihood = -97.284123 (not concave)
            Iteration 3: log likelihood = -96.765768
            Iteration 4: log likelihood = -95.48229
            Iteration 5: log likelihood = -94.352831
            Iteration 6: log likelihood = -94.118324
            Iteration 7: log likelihood = -94.106964
            Iteration 8: log likelihood = -94.106906
            Iteration 9: log likelihood = -94.106906

            Weibull regression -- log relative-hazard form

            No. of subjects = 137 Number of obs = 2558
            No. of failures = 38
            Time at risk = 2558
            LR chi2(8) = 29.00
            Log likelihood = -94.106906 Prob > chi2 = 0.0003

            ------------------------------------------------------------------------------
            _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            DensityLag | 1.022021 .0358664 0.62 0.535 .9540877 1.094792
            DensityLagSq | 1.000424 .0002903 1.46 0.144 .9998549 1.000993
            Salience | 5.441066 13.14681 0.70 0.483 .0477522 619.975
            CRMood | 1.254148 .1358021 2.09 0.036 1.014329 1.550669
            CumHaz | .0084495 .0617747 -0.65 0.514 5.05e-09 14125.5
            recent_comp | .9631279 .0120582 -3.00 0.003 .9397819 .9870538
            distant_comp | 1.015353 .0059239 2.61 0.009 1.003808 1.02703
            duration | .5766607 .1142344 -2.78 0.005 .3911112 .8502377
            -------------+----------------------------------------------------------------
            /ln_p | 1.945866 .2583868 7.53 0.000 1.439437 2.452295
            -------------+----------------------------------------------------------------
            p | 6.999692 1.808628 4.218322 11.61497
            1/p | .1428634 .036914 .0860958 .2370611
            ------------------------------------------------------------------------------

            I am most interested in two variables—recent_comp and distant_comp. Theory suggests that the first should be negatively associated with the dependent variable (that is, as it goes up, the probability of death should go down), and the second should be positively associated with the dependent variable (as it goes up, the probability of death should go up too). It appears to me that this is precisely what these results show. Moreover, duration—which is essentially the age of the group—is significant as well, which also fits the theory.

            But here is where things get difficult for me. When I run this instead (which is the same model without the duration term)….

            3. streg DensityLag DensityLagSq Salience CRMood CumHaz recent_comp distant_comp, distribution(weibull)

            The results are much, much different. Here they are:
            failure _d: died
            analysis time _t: duration
            id: groupid

            Fitting constant-only model:

            Iteration 0: log likelihood = -109.22803
            Iteration 1: log likelihood = -108.60662
            Iteration 2: log likelihood = -108.60451
            Iteration 3: log likelihood = -108.60451

            Fitting full model:

            Iteration 0: log likelihood = -108.60451
            Iteration 1: log likelihood = -102.41646
            Iteration 2: log likelihood = -100.65135
            Iteration 3: log likelihood = -100.59864
            Iteration 4: log likelihood = -100.59857
            Iteration 5: log likelihood = -100.59857

            Weibull regression -- log relative-hazard form

            No. of subjects = 137 Number of obs = 2558
            No. of failures = 38
            Time at risk = 2558
            LR chi2(7) = 16.01
            Log likelihood = -100.59857 Prob > chi2 = 0.0250



            ------------------------------------------------------------------------------
            _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            DensityLag | .9606186 .0268427 -1.44 0.150 .9094227 1.014697
            DensityLagSq | 1.00036 .0002732 1.32 0.188 .9998246 1.000896
            Salience | 6.249215 15.17448 0.75 0.450 .0535697 729.0065
            CRMood | 1.12403 .1125045 1.17 0.243 .9238061 1.36765
            CumHaz | .0192749 .1414929 -0.54 0.591 1.09e-08 34157.23
            recent_comp | .9963076 .0058868 -0.63 0.531 .9848363 1.007912
            distant_comp | .998507 .0021714 -0.69 0.492 .9942603 1.002772
            -------------+----------------------------------------------------------------
            /ln_p | .7730337 .1852955 4.17 0.000 .4098612 1.136206
            -------------+----------------------------------------------------------------
            p | 2.166328 .401411 1.506609 3.114929
            1/p | .4616105 .0855344 .3210346 .6637424
            ------------------------------------------------------------------------------

            As you can see, now the recent_comp and distant_comp variables do not come close to statistical significance. Moreover, distant_comp changes signs!
            I am not sure what to do! Just for comparison purposes, here is what happens when I use relogit instead:

            4. relogit died DensityLag DensityLagSq Salience CRMood CumHaz recent_comp distant_comp duration

            (1 missing value generated)
            Corrected logit estimates Number of obs = 2558
            ------------------------------------------------------------------------------
            | Robust
            died | Coef. Std. Err. z P>|z| [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            DensityLag | -.055275 .0280083 -1.97 0.048 -.1101703 -.0003797
            DensityLagSq | .0002711 .0002421 1.12 0.263 -.0002033 .0007456
            Salience | 1.909248 2.244534 0.85 0.395 -2.489957 6.308453
            CRMood | .0845947 .0897526 0.94 0.346 -.0913173 .2605066
            CumHaz | -1.465713 6.270209 -0.23 0.815 -13.7551 10.82367
            recent_comp | .0055305 .0046471 1.19 0.234 -.0035777 .0146386
            distant_comp | -.004173 .0020091 -2.08 0.038 -.0081108 -.0002352
            duration | .0585928 .0263321 2.23 0.026 .0069828 .1102028
            _cons | -7.541998 4.463431 -1.69 0.091 -16.29016 1.206166
            ------------------------------------------------------------------------------

            These results are not so good for me… Again distant_comp takes on a sign different from it had when I ran streg, AND it contradicts the theory to boot!

            So, in the end, I am looking for guidance. Again, I know this is more a statistical consulting question than a Stata question, and for that I am sorry. I hope it is not inappropriate! I just appreciate so much the knowledge of all you statalisters that I thought I would go ahead and tap your brains for some advice. Thanks so much for all your help thus far!

            Comment


            • #7
              Hello Tony,

              I'm not sure if I can help you much with your model, mostly because (apart from being from a different field) I still don't know the characteristics of your variables and therefore it becomes tough to understand the rationale of your preferences.

              But I'll try anyway.

              This being said (and that was implied in my last message), "age" was already inside your model (under - stset -), since "age" is not really the age of the organation but, according to what you mentioned, just a name you chose for the survival time. Therefore, I believe you may avoid to insert it in the regression.

              I also noticed you have (only) 38 "failures" in just 137 organizations over a (long) time span of 50 years... That's something to provide a thorough reflection. Please rethink about the predictors you decided to include and, by the way, check if they go on the same trend of previous researchs on this field.

              Furthermore, you may consider quitting the squared terms, since it's best to start "simpler" and you have few events anyway. Besides, they seemed to be nonsignificant under your tentative models. By the way, since we talked over "simplifications", have you checked the Kaplan-Meier curves of the whole group as well as according to dichotomized variables you considered important to "explain" the failures?

              To end, I didn't realize what a variable named "CumHaz" mean in your model. Would it be the cummulative hazard of the very same model? If so, you may also think about getting rid of it.

              Hopefully that might be of help.

              Best,

              Marcos

              Best regards,

              Marcos

              Comment


              • #8
                Marcos and others. So I suppose this is what it comes down to... I really cannot include an Age variable in this model. Is this basically correct? Again, I would like to determine what effect Age has on the death of a group, but as I understand your answer, I simply cannot do it after I --stset-- my data. That is unfortunate, but I suppose I am glad I know it now.

                As for small number of failures...only 38...Yes, that makes things difficult. All the predictors I include are theory-driven, but the small number of failures make this analysis difficult at best, and perhaps silly at worst!

                I think I shall take your advice and dump the squared terms. This makes sense. And yes, CumHax is the cumulative hazard. I will dump that too, and see what happens.
                Finally, the big problem for me is this; When I DO include the age term in the model--which as you say, seems to make no sense--the results are good for me. Without it, the results are not so good. So I suppose I should just accept that maybe my results are null.

                At any rate, I appreciate all your input! It has been very helpful!

                Tony

                Comment


                • #9
                  Originally posted by Tony Silva View Post
                  So I suppose this is what it comes down to... I really cannot include an Age variable in this model. Is this basically correct?
                  That is not correct. If you do a parametric survival analysis model you already let the hazard change over time, so your variable age is already in your model even if you don't include it. You can see how time affects the hazard using stcurv. However, if you also manually include age in your model, this no longer works correctly and it becomes very hard to interpret the effect of age as now there are two competing effects of time in one model. What you can do is either estimate your Weibull model and interpret it correctly by excluding age, but look carefully at stcurv, or you can estimate an exponential model, which assumes the hazard does not change over time and include some function of time. That way you don't have two competing effects of time in the same model, and it becomes feasible to interpret the effect of time. A popular choice is a set of indicator variables (dummies) for different age-groups, the so-called piecewise constant model.
                  Last edited by Maarten Buis; 23 Mar 2015, 15:13.
                  ---------------------------------
                  Maarten L. Buis
                  University of Konstanz
                  Department of history and sociology
                  box 40
                  78457 Konstanz
                  Germany
                  http://www.maartenbuis.nl
                  ---------------------------------

                  Comment


                  • #10
                    Maarten: That is perfect... thanks so much. I really appreciate this advice. Tony

                    Comment


                    • #11
                      Hi Maarten. Would such an age (time) variable as described above already be in the stcrreg competing risk model as well? In other words, would it be incorrect to include such an age variable in a stcrreg model? From I can gather the stcrreg is also a parametric model. Thanks! Best, Erik

                      Comment

                      Working...
                      X