Question about streg

Tony Silva

Join Date: Mar 2015

Posts: 5
#1

Question about streg

22 Mar 2015, 17:51

Hello All:

I realize that this is a statistics question rather than a Stata question, and I hope this is OK. Here is my situation. I have some event history data. I have approximately 100 cases, each of which is an organization. Each line of data is an organization-year. My dependent variable is a dummy that takes a value of 0 if the group is alive in the given year, and a 1 the year that the group dies. As soon as a group gets a 1 on the dependent variable, it exits the data.
I am using the streg command to estimate a model that seeks to determine what sorts of things affect group death. One of the variables I would like to include in the model is AGE. So, for the first year a group is in the data it would have an AGE value of 1, and so on. I realize that estimating a Cox model with an AGE variable is impossible. However, is using the streg command and including an AGE variable in this manner a reasonable thing to do? I appreciate any feedback you can give me.
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

22 Mar 2015, 18:38

Hello Tony,

Maybe you could get more help if you present an example of the display of your data, the commands you already used and the output.

This being said, I wish to comment briefly on four points:

First, and theoretically, I think Cox regression poses no obstacles for inserting "age" as a variable.
Second, - streg - may be an interesting choice when estimating the baseline hazard is a matter of concern, or, among other possibilities, when you want to put accelerate failure time into a metric. You presented little information about your data, but, apparently, a parametric survival analysis seems not to be your what you need.
Third, you said each line represents an organization but you also mentioned "groups". That was not clear to me. Are both the same thing, or "group" means "group of organizations"?
Finally, excuse me if I may not have understood well your query, but I got the impression that what you mean by "age" could well be the "time variable", already specified when you performed the - stset- command.

Hopefully that helps.

Best,

Marcos

Best regards,

Marcos
Comment
Tony Silva

Join Date: Mar 2015

Posts: 5
#3

22 Mar 2015, 19:26

Hi Marcos:
Thanks so much for responding. Basically, I have time series cross-sectional data for around 100 organizations, and my goal is to find out what sorts of things cause groups to die. (Yes, in the other post I used the terms “organizations” and “groups” interchangeably. Sorry!) Here is what the data look like for a hypothetical case of an organization that was founded in 1989 and died in 1996. The lines of data would look like this:
Group Name Year Dead IV1 IV2 Time (age)
Group A 1989 0 55 0 1
Group A 1990 0 53 1 2
Group A 1991 0 58 1 3
Group A 1992 0 36 1 4
Group A 1993 0 43 0 5
Group A 1994 0 67 0 6
Group A 1995 0 45 1 7
Group A 1996 1 23 1 8
Dead is my dependent variable, and I have a few other independent variables, here denoted by IV1 and IV2. (In this hypothetical example, these are meaningless). Then, I have an age variable, which as you point out, is indeed the same as a time variable (which is indeed created when I stset the data).

I hope this makes sense. When I estimate a cox model, using stcox, I get a standard error for the time estimate that is so huge as to be nonsensical. When I use streg instead, however, I get a nice estimate that is even significant. Perhaps I am in over my head here… Perhaps relogit is a better choice? I am just not sure. I thought streg made sense as really, all I want to do is determine which of my variables—including time(age)—increases or decreases the probability of death.
As always, I would appreciate any and all guidance, and thanks for your patience with a neophyte… Tony
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17676
#4

23 Mar 2015, 03:11

Tony:
-echoing especially one of Marcos' helpful recommendations, I would encourage you to post what you typed and what Stata gave you back (as per FAQ).
- -streg- is plenty of parametrization choices (obviously, it'up to you -maybe following the research strategy that others paved in the past in dealing with your very same research topic- selecting which one makes sense in your research field) which work differently from semi-parametric Cox regression. Hence. no wonder that you have found out wide differences between these two approaches.

Kind regards,
Carlo
(Stata 19.0)
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#5

23 Mar 2015, 03:41

stcox explicitly does not estimate an effect of time, it just adjusts for this in a non-parametric manner. This is a strength and a weakness. It is a strength in the sense that you cannot make an error in something you do not estimate. It is a weakness in the sense that you cannot interpret something you do not estimate. Since you seem to be interested in the effect of time (age) a Cox model is not for you; it is a great model but it does not answer your question. Instead you can look for the parametric survival models, or you could look for stpm2 (type in Stata findit stpm2 and follow the instructions) for a more flexible alternative.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Tony Silva

Join Date: Mar 2015

Posts: 5
#6

23 Mar 2015, 08:29

Hello All... Thanks so much for your help thus far!

1. I began with stset.

Then I got this:

stset duration, id(groupid) failure(died)
id: groupid
failure event: died != 0 & died < .
obs. time interval: (duration[_n-1], duration]
exit on or before: failure
------------------------------------------------------------------------------
2559 total obs.
0 exclusions
------------------------------------------------------------------------------
2559 obs. remaining, representing
137 subjects
38 failures in single failure-per-subject data
2559 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 55

As this output suggests, my cases are individual groups (each of which has an ID number), and the event is died. Died is equal to 0 for every year in which a group was alive, and 1 for the year in which the group died. After Died turns to 1 (that is, the group dies), the group exits the dataset.

2. Next, I run this model:

streg DensityLag DensityLagSq Salience CRMood CumHaz recent_comp distant_comp duration, distribution(weibull)

The output looks like this:

failure _d: died
analysis time _t: duration
id: groupid

Fitting constant-only model:

Iteration 0: log likelihood = -109.22803
Iteration 1: log likelihood = -108.60662
Iteration 2: log likelihood = -108.60451
Iteration 3: log likelihood = -108.60451

Fitting full model:

Iteration 0: log likelihood = -108.60451
Iteration 1: log likelihood = -102.48844
Iteration 2: log likelihood = -97.284123 (not concave)
Iteration 3: log likelihood = -96.765768
Iteration 4: log likelihood = -95.48229
Iteration 5: log likelihood = -94.352831
Iteration 6: log likelihood = -94.118324
Iteration 7: log likelihood = -94.106964
Iteration 8: log likelihood = -94.106906
Iteration 9: log likelihood = -94.106906

Weibull regression -- log relative-hazard form

No. of subjects = 137 Number of obs = 2558
No. of failures = 38
Time at risk = 2558
LR chi2(8) = 29.00
Log likelihood = -94.106906 Prob > chi2 = 0.0003

------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
DensityLag | 1.022021 .0358664 0.62 0.535 .9540877 1.094792
DensityLagSq | 1.000424 .0002903 1.46 0.144 .9998549 1.000993
Salience | 5.441066 13.14681 0.70 0.483 .0477522 619.975
CRMood | 1.254148 .1358021 2.09 0.036 1.014329 1.550669
CumHaz | .0084495 .0617747 -0.65 0.514 5.05e-09 14125.5
recent_comp | .9631279 .0120582 -3.00 0.003 .9397819 .9870538
distant_comp | 1.015353 .0059239 2.61 0.009 1.003808 1.02703
duration | .5766607 .1142344 -2.78 0.005 .3911112 .8502377
-------------+----------------------------------------------------------------
/ln_p | 1.945866 .2583868 7.53 0.000 1.439437 2.452295
-------------+----------------------------------------------------------------
p | 6.999692 1.808628 4.218322 11.61497
1/p | .1428634 .036914 .0860958 .2370611
------------------------------------------------------------------------------

I am most interested in two variables—recent_comp and distant_comp. Theory suggests that the first should be negatively associated with the dependent variable (that is, as it goes up, the probability of death should go down), and the second should be positively associated with the dependent variable (as it goes up, the probability of death should go up too). It appears to me that this is precisely what these results show. Moreover, duration—which is essentially the age of the group—is significant as well, which also fits the theory.

But here is where things get difficult for me. When I run this instead (which is the same model without the duration term)….

3. streg DensityLag DensityLagSq Salience CRMood CumHaz recent_comp distant_comp, distribution(weibull)

The results are much, much different. Here they are:
failure _d: died
analysis time _t: duration
id: groupid

Fitting constant-only model:

Iteration 0: log likelihood = -109.22803
Iteration 1: log likelihood = -108.60662
Iteration 2: log likelihood = -108.60451
Iteration 3: log likelihood = -108.60451

Fitting full model:

Iteration 0: log likelihood = -108.60451
Iteration 1: log likelihood = -102.41646
Iteration 2: log likelihood = -100.65135
Iteration 3: log likelihood = -100.59864
Iteration 4: log likelihood = -100.59857
Iteration 5: log likelihood = -100.59857

Weibull regression -- log relative-hazard form

No. of subjects = 137 Number of obs = 2558
No. of failures = 38
Time at risk = 2558
LR chi2(7) = 16.01
Log likelihood = -100.59857 Prob > chi2 = 0.0250

------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
DensityLag | .9606186 .0268427 -1.44 0.150 .9094227 1.014697
DensityLagSq | 1.00036 .0002732 1.32 0.188 .9998246 1.000896
Salience | 6.249215 15.17448 0.75 0.450 .0535697 729.0065
CRMood | 1.12403 .1125045 1.17 0.243 .9238061 1.36765
CumHaz | .0192749 .1414929 -0.54 0.591 1.09e-08 34157.23
recent_comp | .9963076 .0058868 -0.63 0.531 .9848363 1.007912
distant_comp | .998507 .0021714 -0.69 0.492 .9942603 1.002772
-------------+----------------------------------------------------------------
/ln_p | .7730337 .1852955 4.17 0.000 .4098612 1.136206
-------------+----------------------------------------------------------------
p | 2.166328 .401411 1.506609 3.114929
1/p | .4616105 .0855344 .3210346 .6637424
------------------------------------------------------------------------------

As you can see, now the recent_comp and distant_comp variables do not come close to statistical significance. Moreover, distant_comp changes signs!
I am not sure what to do! Just for comparison purposes, here is what happens when I use relogit instead:

4. relogit died DensityLag DensityLagSq Salience CRMood CumHaz recent_comp distant_comp duration

(1 missing value generated)
Corrected logit estimates Number of obs = 2558
------------------------------------------------------------------------------
| Robust
died | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
DensityLag | -.055275 .0280083 -1.97 0.048 -.1101703 -.0003797
DensityLagSq | .0002711 .0002421 1.12 0.263 -.0002033 .0007456
Salience | 1.909248 2.244534 0.85 0.395 -2.489957 6.308453
CRMood | .0845947 .0897526 0.94 0.346 -.0913173 .2605066
CumHaz | -1.465713 6.270209 -0.23 0.815 -13.7551 10.82367
recent_comp | .0055305 .0046471 1.19 0.234 -.0035777 .0146386
distant_comp | -.004173 .0020091 -2.08 0.038 -.0081108 -.0002352
duration | .0585928 .0263321 2.23 0.026 .0069828 .1102028
_cons | -7.541998 4.463431 -1.69 0.091 -16.29016 1.206166
------------------------------------------------------------------------------

These results are not so good for me… Again distant_comp takes on a sign different from it had when I ran streg, AND it contradicts the theory to boot!

So, in the end, I am looking for guidance. Again, I know this is more a statistical consulting question than a Stata question, and for that I am sorry. I hope it is not inappropriate! I just appreciate so much the knowledge of all you statalisters that I thought I would go ahead and tap your brains for some advice. Thanks so much for all your help thus far!
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

23 Mar 2015, 09:39

Hello Tony,

I'm not sure if I can help you much with your model, mostly because (apart from being from a different field) I still don't know the characteristics of your variables and therefore it becomes tough to understand the rationale of your preferences.

But I'll try anyway.

This being said (and that was implied in my last message), "age" was already inside your model (under - stset -), since "age" is not really the age of the organation but, according to what you mentioned, just a name you chose for the survival time. Therefore, I believe you may avoid to insert it in the regression.

I also noticed you have (only) 38 "failures" in just 137 organizations over a (long) time span of 50 years... That's something to provide a thorough reflection. Please rethink about the predictors you decided to include and, by the way, check if they go on the same trend of previous researchs on this field.

Furthermore, you may consider quitting the squared terms, since it's best to start "simpler" and you have few events anyway. Besides, they seemed to be nonsignificant under your tentative models. By the way, since we talked over "simplifications", have you checked the Kaplan-Meier curves of the whole group as well as according to dichotomized variables you considered important to "explain" the failures?

To end, I didn't realize what a variable named "CumHaz" mean in your model. Would it be the cummulative hazard of the very same model? If so, you may also think about getting rid of it.

Hopefully that might be of help.

Best,

Marcos

Best regards,

Marcos
Comment
Tony Silva

Join Date: Mar 2015

Posts: 5
#8

23 Mar 2015, 09:58

Marcos and others. So I suppose this is what it comes down to... I really cannot include an Age variable in this model. Is this basically correct? Again, I would like to determine what effect Age has on the death of a group, but as I understand your answer, I simply cannot do it after I --stset-- my data. That is unfortunate, but I suppose I am glad I know it now.

As for small number of failures...only 38...Yes, that makes things difficult. All the predictors I include are theory-driven, but the small number of failures make this analysis difficult at best, and perhaps silly at worst!

I think I shall take your advice and dump the squared terms. This makes sense. And yes, CumHax is the cumulative hazard. I will dump that too, and see what happens.
Finally, the big problem for me is this; When I DO include the age term in the model--which as you say, seems to make no sense--the results are good for me. Without it, the results are not so good. So I suppose I should just accept that maybe my results are null.

At any rate, I appreciate all your input! It has been very helpful!

Tony
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#9

23 Mar 2015, 14:10

Originally posted by Tony Silva View Post

So I suppose this is what it comes down to... I really cannot include an Age variable in this model. Is this basically correct?

That is not correct. If you do a parametric survival analysis model you already let the hazard change over time, so your variable age is already in your model even if you don't include it. You can see how time affects the hazard using stcurv. However, if you also manually include age in your model, this no longer works correctly and it becomes very hard to interpret the effect of age as now there are two competing effects of time in one model. What you can do is either estimate your Weibull model and interpret it correctly by excluding age, but look carefully at stcurv, or you can estimate an exponential model, which assumes the hazard does not change over time and include some function of time. That way you don't have two competing effects of time in the same model, and it becomes feasible to interpret the effect of time. A popular choice is a set of indicator variables (dummies) for different age-groups, the so-called piecewise constant model.

Last edited by Maarten Buis; 23 Mar 2015, 14:13.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Tony Silva

Join Date: Mar 2015

Posts: 5
#10

24 Mar 2015, 06:44

Maarten: That is perfect... thanks so much. I really appreciate this advice. Tony
Comment
Erik Aadland

Join Date: Jul 2014

Posts: 64
#11

09 Mar 2024, 11:44

Hi Maarten. Would such an age (time) variable as described above already be in the stcrreg competing risk model as well? In other words, would it be incorrect to include such an age variable in a stcrreg model? From I can gather the stcrreg is also a parametric model. Thanks! Best, Erik
Comment

Announcement

Question about streg

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment