Cox model with two time varying independent variables: not sure how to approach it

Eduard López

Join Date: Dec 2014

Posts: 48
#1

Cox model with two time varying independent variables: not sure how to approach it

19 Oct 2023, 11:08

I want to perform a proportional hazards model to study if job contracts were affected and became shorter/dismissal was more likely to happen after two labour reforms took place.

I understand that the reforms are my independent variables, but I think that they would be time-varying variables since they only start affecting contracts once they are enforced. My time variable would be the duration of the job contract and the failure event would be contract ending. I am not sure how to approach this with Stata.

A bit more info:

I have a database that spans from the late 70s to 2019 and where the units of observation are job contracts (so each “subject” . I want to know if two labour reforms that were passed in 2010 and 2012, respectively, and that reduced protection against unfair dismissal, had any effect on the duration of job contracts. The database has data on when each contract started, when it finished and the reason for it to end (ie. dismissal, retirement etc.).

I am thus interested in performing survival analysis on the duration of job contracts. I think that my independent variables, the two labour reforms, are two time varying exposure variables, since before 2010 or 2012 no contract was “exposed” to the effects of any of the reforms, then between 2010-2012 contracts are exposed to the 2010 reform and 2012 they are exposed to the 2012 reform.

I don’t have any control group, as all job contracts that were ongoing after the date the reforms I mentioned were enforced were “exposed” to them (ie. if someone was working after the labour reform was passed, the reform affected him or her, a priori). Job contracts kept starting and finishing during the time period studied (so a contract might start before the labour reform was passed and then be exposed to the reform once it is passed, while another might already start after the reform was enforced or another started and ended before the reform was passed). In addition, some contracts were ongoing when data were gathered, so they would be right-censored (they are coded as “finishing” in 2099).

So in my mind, the two labour reforms of 2010 and 2012 would be time varying variables. My units of observation, job contracts, would have a value of 0 for the variable of labour reform 2010 for most of the time, but when it started to be enforced that value would change to 1. Then, when the 2012 labour reform started to be enforced, the 2010 variable would go back to 0 value and another variable for the 2012 reform would assume a value of 1.

Besides the two labour reforms, I also suspect that the economic crisis that hit the country between 2008 and 2014 might also logically affect contracts becoming shorter or ending, so I thought I would also need to treat this as a time varying covariate that only has a value of 1 for contracts during the timespan of 2008 and 2014.

I already have a variable for the duration of job contracts, and another categorical variable if failure happened (if the contract finished and it wasn’t right censored). I thought about creating two variables for “exposure” to the labour reforms: I created a variale that was the substraction of the date the contract ended and the date the reform started to be enforced. But I don’t know what to do after that. Would really appreciate any help.
Tags: cox model, survival analysis
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

19 Oct 2023, 11:25

There are two different ways in which the term "time varying covariates" is used in survival analysis, and you may be confusing them.

From your description I am inferring that what you have is a data set whose unit of analysis is the contract, and their duration is your outcome variable. Some of these contracts exist partially or totally in time periods where no, one, or both of these policy interventions was in effect. In particular, a contract that starts before 2010 and endures until 2014 would begin in an era with no policy in effect. Between 2010 and 2012 one policy would be in effect, and after 2012 both policies would be in effect. This is time varying covariates in the sense of explanatory/predictor variables that are not constant over time. Nevertheless, I am imagining that you expect that whatever the effect of these policies may be, the policy effect is the same in, say, 2014 as it is in 2020. So the value of the predictor variable changes over time, but the effect of that policy on the outcome is the same across all times when it is in effect.

The way to handle this situation is to use the -stsplit- command to break up your data, with each contract having (up to) 3 observations: one observation reflecting the period prior to 2010, the second representing the period between 2010 and 2012 (assuming the contract did not terminate before 2010), and the third representing the period from 2012 until termination or end of study data (again, assuming the contract did not terminate before 2012). You would set your two policy indicator variables both to zero in the first observation, the one for the 2010 policy to 1 and the one for the 2012 policy to 0 in the second observation, and both of them to 1 in the third observation.

Now, if you think that on top of this, the effects of the policies change over time while they are in effect, that is, in ordinary regression language, an interaction between the policy variable and _t (the analysis time variable that Stata creates when you -stset- your data). Confusingly, this is also called time-varying covariates. In fact, this is what is usually meant when the phrase "time-varying covariates" is used. But from your description of the situation I see nothing that suggests that this is your situation. If it is, then there are other complications that come into play and have to be dealt with. If this is where you find yourself, then post back for further advice.

Last edited by Clyde Schechter; 19 Oct 2023, 11:29.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#3

20 Oct 2023, 02:31

You are absolutely right Clyde (if I may), the idea for my study design is the one you described: policies only have an effect during some time interval, once they become enforced laws, they do not change over time (I can imagine that, in theory, their enforcement might vary over time, legal precedent might change how they are applied etc. but that is beyond the scope of my study and in any case would be difficult to operativize).

I am going to read into the -stsplit- command to break up the data in different periods. Would it be possible to include another "date based" covariate such as economic crisis so that I would, in effect, have those three observations you mentioned for a) before 2010 b) 2010-2012 c) after 2012 but also a fourth one that overlaps with them for the economic crisis that would be something akin to d) 2008-2014? So that the values would look something like this:

Period variable Value before 2008 Value 2008-2010 Value 2010-2012 Value 2012-2014 Value After 2014

2010 policy 0 0 1 0 0

2012 policy 0 0 0 1 1

Economic crisis 0 1 1 1 0

I was thinking about doing so in order to avoid possible confounding effects of the economic crisis on my predictor variables (since they both began to be enforced during the crisis).

From what I understand, -stsplit- splits the data at fixed intervals of analysis time (ie. 5 units of time after the start of analysis time) and not by calendar dates. But I'm reading this thread where you also were most helpful and explained how to implement calendar dates (https://www.statalist.org/forums/for...interview-data), so hopefully I'll be able to figure it out.

Thank you so much for your help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

20 Oct 2023, 09:36

Adding the financial crisis variable may or may not prove problematic. If you regress the economic crisis variable on the 2010 and 2012 policy variables, you get an R² of 0.83. So it is not going to be easy to disentangle their overlapping effects. If your data set is sufficiently large, this multi-collinearity will not be a problem. But if it isn't, you could end up with very wide confidence intervals around the coefficients of those effects and an inconclusive study. I think the only way to know is to try it and see what happens. But I wouldn't even try to go there with a small data set.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#5

23 Oct 2023, 08:08

The database has several million observations, as it is an administrative registry of labour contracts. Only some of these fall into the period of the Great Recession, but still, I think it's so large that statistical significance won't pose a problem (fingers crossed).

Thank you for your help Clyde. I'm going to sit down and learn about -stsplit-, I'll try to find some tutorials and examples to replicate and then try to adapt them to my case.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#6

26 Oct 2023, 13:08

Hello Clyde and friends,

I've been trying my hand at -stsplit-, (reading the Stata manual and following this example from Stata Journal step by step: https://journals.sagepub.com/doi/pdf...867X0400400212) but I feel a bit lost about it.

If I understand correctly, I need to tell -stsplit- to split the data at intervals according to the dates where my predictor variable (whether a policy is in effect or not, let's leave the economic crisis aside for now) changes.
This is an example of the "contract start" variable from my database (the date each contract began):

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long F_ALTA 19881103 19760501 19790808 19761102 19990901 19950101 19971001 19860414 19900110 19920528 end

1) Now, if I'm not wrong, Stata calculates dates based on days from 1st January, 1960. First question: should I generate a new variable that contains the dates cited above as days ellapsed since 1st January 1960? So, for example, the first value (3rd November 1988) would be 10534 in the new variable. I'm asking this because in the thread where you Clyde helped another user make use of -stsplit- with date variables (https://www.statalist.org/forums/for...interview-data), you showed an example where the date variables seem to me to have values of days ellapsed since 1st January 1960. So I wasn't sure if I'm experiencing problems because I'm not doing the same. How do you generate a new variable that calculates dates like that? (For instance, a value of 19600102 in yyymmdd would be a value of 2 in days ellapsed since 1st January 1960).

2) I tried using -stset-, -stsplit-, and then -sts test-, -stsgraph-, -stcox- with the following syntax.

Code:

stset dcom, failure(fincont==1) id(id) stsplit policy, at(18430 19003) sts test (policy), logrank sts graph, by(policy) stcox i.policy

My first attempt at using -stset- did containt the failure even, but no id variable and I was requested to include one when using stsplit later. I have never used the id option at -stset-, what is the reason to use it? Incidentally, I do have an id variable (which identifies if two observations, that is, two contracts, belong to the same person) so I used it.

My -stsplit- included the option at with two values, the date each policy started to be enforced, calculated as days ellapsed since January 1st 1960. So 18430 for 17 June 2010 and 19003 for 12 January 2012 .

However, none of this worked.
-sts test- came up with these results:

Code:

Failure _d: fincont==1 Analysis time _t: dcom ID variable: id Equality of survivor functions Log-rank test | Observed Expected policy | events events -------+------------------------- 0 | 783020 783020.00 18430 | 0 0.00 19003 | 0 0.00 -------+------------------------- Total | 783020 783020.00 chi2(2) = 0.00 Pr>chi2 = 1.0000

-stsgraph- gave me a weird survival curve that looks like this: https://www.dropbox.com/scl/fi/inhfc...wt6aenjy0&dl=0
Before I used -stsplit- I wasn't splitting episodes for my time varying covariate (I was just using a variable for the time period if the contract finished within two year windows) but the curves looked like this: https://www.dropbox.com/scl/fi/ldeff...5yb8fhh7n&dl=0

Finally, -stcox- simple gives me HR of 1:

Code:

Failure _d: fincont==1 Analysis time _t: dcom ID variable: id Iteration 0: log likelihood = -10270777 Refining estimates: Iteration 0: log likelihood = -10270777 Cox regression with Breslow method for ties No. of subjects = 1,016,383 Number of obs = 2,331,011 No. of failures = 783,020 Time at risk = 2847575167 LR chi2(0) = 0.00 Log likelihood = -10270777 Prob > chi2 = . ------------------------------------------------------------------------------ _t | Haz. ratio Std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- policy | 18430 | 1 (omitted) 19003 | 1 (omitted) ------------------------------------------------------------------------------ . sts graph, by(policy) Failure _d: fincont==1 Analysis time _t: dcom ID variable: id

I apologize if I'm making very basic mistakes (which I'm sure I am), but I'm new to the use of stsplit and really struggling to understand how should I try to use it. What should I try to get on the right track?
Thanks for your patience and help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#7

26 Oct 2023, 13:56

1) Now, if I'm not wrong, Stata calculates dates based on days from 1st January, 1960. First question: should I generate a new variable that contains the dates cited above as days ellapsed since 1st January 1960? So, for example, the first value (3rd November 1988) would be 10534 in the new variable. I'm asking this because in the thread where you Clyde helped another user make use of -stsplit- with date variables (https://www.statalist.org/forums/for...interview-data), you showed an example where the date variables seem to me to have values of days ellapsed since 1st January 1960. So I wasn't sure if I'm experiencing problems because I'm not doing the same. How do you generate a new variable that calculates dates like that? (For instance, a value of 19600102 in yyymmdd would be a value of 2 in days ellapsed since 1st January 1960).

Yes, you absolutely must use proper Stata internal format date variables, which are, in fact, calculated as the elapsed days from 1 Jan 1960. Attempting to use numbers like 19881103 as dates in Stata calculations will always fail. So you must convert these. I think the easiest way to do that with your data is to use the -tostring- command and then the -daily()- function.

My first attempt at using -stset- did containt the failure even, but no id variable and I was requested to include one when using stsplit later. I have never used the id option at -stset-, what is the reason to use it?

In your original data set, I suppose, each observation characterized a single person. At least that should have been true if you had no id variable--you would get wrong results without an id variable if the same person could have more than one observation in the data set. It's the same issue as with anything else: observations are presumed to be independent unless some indication to the contrary is given. If you have multiple observations for the same person, those observations are not independent, and Stata needs to know that they should be grouped together. -stsplit- creates multiple observations for the same person, and the way -stset- expects to learn about such a situation is with an id variable.

Concerning the other problems you are having, if you post back using -dataex- to show an example of your data set, as it was before -stsplit-, containing all of the relevant variables and a modest number of observations, I'll try to get this to work for you.
Comment

Eduard López

Join Date: Dec 2014
Posts: 48

27 Oct 2023, 08:48

In your original data set, I suppose, each observation characterized a single person. At least that should have been true if you had no id variable--you would get wrong results without an id variable if the same person could have more than one observation in the data set. It's the same issue as with anything else: observations are presumed to be independent unless some indication to the contrary is given. If you have multiple observations for the same person, those observations are not independent, and Stata needs to know that they should be grouped together. -stsplit- creates multiple observations for the same person, and the way -stset- expects to learn about such a situation is with an id variable.

Yes, the database has an "id" variable, because the data itself are a bit peculiar: instead of each observation corresponding to a single worker, each observatino corresponds to a single job contract: the rationale for this being that a single person might have different contracts throughout his lifetime, but also that the same person might have multiple (usually part-time) contracts at the same time. So the "id" variable is used to identify if multiple contracts belong to the same person.

So, for instance, if we had only three observations, that is, three contracts, and two of them with durations of 200 and 500 days respectively corresponded to the same person, it would look something like:

Variable	Sex	Duration of contract (days)	Id
Contract#1	Male	200	1
Contract#2	Male	500	1
Contract#3	Female	1970	2

Would then that "id" variable be useful for the -stset- command in order to anticipate -stsplit-?

Concerning the other problems you are having, if you post back using -dataex- to show an example of your data set, as it was before -stsplit-, containing all of the relevant variables and a modest number of observations, I'll try to get this to work for you.

Thanks a lot Clyde. Here's the dataex example. I have changed the name of the variables, as the original names were in Spanish, so that they were more self-descriptive in English. Let me know if I should change labels too. I have also tried to generate Stata internal format date variables from my date variables. So you will see that, for instance, the value of 19881103 in "startcont" corresponds to 10534 in "startcontdate".

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long(startcont endcont) float(startcontdate endcontdate) int yearbir byte monthbir float typecont byte causeend float(termcont duration) long id
19881103 19890621 10534 10764 1960 6 3 54 1   230 32
19760501 19761030  5965  6147 1960 6 3 51 .   182 32
19790808 19801115  7159  7624 1960 6 3 74 .   465 32
19761102 19770727  6150  6417 1960 6 3 51 .   267 32
19990901 20991231 14488 51134 1960 6 1  0 . 36646 32
19950101 19970228 12784 13573 1960 6 3 54 1   789 32
19971001 19990831 13788 14487 1960 6 3 54 1   699 32
19860414 19880125  9600 10251 1960 6 3 51 .   651 32
19900110 19920526 10967 11834 1960 6 3 74 .   867 32
19920528 19941231 11836 12783 1960 6 3 54 1   947 32
19970301 19970930 13574 13787 1960 6 3 54 1   213 32
19880314 19880914 10300 10484 1960 6 3 54 1   184 32
19801020 19801120  7598  7629 1955 9 3 54 1    31 34
19810421 19810721  7781  7872 1955 9 3 54 1    91 34
20140902 20190531 19968 21700 1956 6 3 51 .  1732 41
20050601 20091031 16588 18201 1956 6 1 54 1  1613 41
19941004 20050531 12695 16587 1956 6 1 54 1  3892 41
20091101 20130731 18202 19570 1956 6 1 91 .  1368 41
20140902 20190627 19968 21727 1956 6 1 58 .  1759 41
19811210 19821130  8014  8369 1956 6 3 54 1   355 41
19821201 19900430  8370 11077 1956 6 3 51 .  2707 41
19810331 19811130  7760  8004 1956 6 3 51 .   244 41
19900502 19900930 11079 11230 1956 6 3 54 1   151 41
19901001 19930910 11231 12306 1956 6 3 54 1  1075 41
19931213 19940930 12400 12691 1956 6 3 51 .   291 41
19910408 19910711 11420 11514 1970 4 3 51 .    94 48
19990419 19990625 14353 14420 1970 4 . 54 1    67 48
20020515 20020723 15475 15544 1970 4 2 54 1    69 48
19880707 19880922 10415 10492 1970 4 3 51 .    77 48
19960801 19960930 13362 13422 1970 4 . 54 1    60 48
end
format %td startcontdate
format %td endcontdate
label values typecont tipcontlabel
label def tipcontlabel 1 "Indefinidos", modify
label def tipcontlabel 2 "Temporales", modify
label def tipcontlabel 3 "No consta", modify
label values causeend labels3a2
label def labels3a2 0 "NO CONSTA", modify
label def labels3a2 51 "DIMISIÓN/BAJA VOLUNTARIA DEL TRABAJADOR", modify
label def labels3a2 54 "BAJA NO VOLUNTARIA", modify
label def labels3a2 58 "BAJA POR PASE A LA SITUACIÓN DE PENSIONISTA", modify
label def labels3a2 74 "BAJA POR SUSPENSIÓN DEL CONTRATO", modify
label def labels3a2 91 "BAJA. CAUSAS OBJ.EMPRESA Y OTRAS RELAT.EMPRESA", modify

Thank you so much, really appreciate your help.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

27 Oct 2023, 11:20

OK. In the earlier posts in this thread, you used fincont as your failure variable. That variable does not appear in your current example data. I see that it has a variable termcont. For the purposes of writing code, I will assume that termcont is the failure variable: it takes on 0/1 values and term and fin are both Latinate roots meaning end. In any case, substitute the actual failure indicator for termcont in this code:

Code:

gen `c(obs_t)' cont_id = _n stset endcontdate, failure(termcont = 1) origin(startcontdate) id(cont_id) scale(365.25) stsplit era, at(`=td(17jun2010)' `=td(11jan2012)')

Comments:
Your unit of analysis here is the contract, not the ID. So I have created an id variable, cont_id, for it. This is the one you should use in your -stset- command because you want -stsplit- to split up the single observations on each contract into the observations for the eras before, between, and after policy changes.

In fact, until your most recent post, I wasn't aware that the same people could appear with multiple contracts in your data. That adds another level of complexity to your analysis. If you are going to use -stcox-, you should specify -shared(id)- so that Stata can account for repeated observations within id (yes, id, not cont_id). If you are going to use a parametric survival analysis model, use -mestreg-, not -streg- for this.

There is nothing wrong, from Stata's perspective, with using 18430 and 19003 in the -at()- option of -stsplit-. But the code also needs to communicate to humans, and magic numbers interfere with that. So better to make Stata do a little extra work evaluating the internal format date equivalents of 17 Jun 2010 and 11 Jan 2012, leaving more understandable code.

Given that your time variables are date variables, whose unit of measurement is days, but the contracts have pretty long durations, I have added the -scale(365.25)- option to the -stset- command so that results will be denominated in years rather than days. Again, from a computations perspective this changes nothing, but it makes the results easier for humans to grasp. Few people have a handle on the meaning of 1927 days, but most are very comfortable with the equivalent 5.28 years..
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#10

28 Oct 2023, 12:36

Professor Schechter, firs of all I want to thank you for taking your time to help me. I really appreciate it.

OK. In the earlier posts in this thread, you used fincont as your failure variable. That variable does not appear in your current example data. I see that it has a variable termcont. For the purposes of writing code, I will assume that termcont is the failure variable: it takes on 0/1 values and term and fin are both Latinate roots meaning end. In any case, substitute the actual failure indicator for termcont in this code:

You are right. I am sorry, I tried to translate the names of the variables but I should have explained it better. "fincont" (that is, "fin de contrato", or "end of contract" meaning that the contract does end, as opposed to those that were still ongoing when data was gathered and thus are right-censored) I named "termcont" for "termination of contract" in order to not confuse it with "endcontdate", which is the date at which the contract ended. All contracts have a "fincont"/"termcont" value of 0 if right censored or 1 if they did end before data was gathered. All contracts have a time value for "endcontdate", but those which were right-censored, the database assigned them a YYYYMMDD value of 20990101. Incidentally, "fincont"/"termcont" was a variable I created myself based on "endcontdate" by simply assigning a value of 0 to it if "endcontdate" had a the forementioned value of 20990101.

Your unit of analysis here is the contract, not the ID. So I have created an id variable, cont_id, for it. This is the one you should use in your -stset- command because you want -stsplit- to split up the single observations on each contract into the observations for the eras before, between, and after policy changes.

If I understand correctly, "my" id variable is an id of the worker in order to know who each job contrat belongs to, but we also need to "identify" each contract so that -stset- can work properly. Makes absolute sense.

In fact, until your most recent post, I wasn't aware that the same people could appear with multiple contracts in your data. That adds another level of complexity to your analysis. If you are going to use -stcox-, you should specify -shared(id)- so that Stata can account for repeated observations within id (yes, id, not cont_id). If you are going to use a parametric survival analysis model, use -mestreg-, not -streg- for this.

I was indeed going to use -stcox-, so I'll add that option (see below).

There is nothing wrong, from Stata's perspective, with using 18430 and 19003 in the -at()- option of -stsplit-. But the code also needs to communicate to humans, and magic numbers interfere with that. So better to make Stata do a little extra work evaluating the internal format date equivalents of 17 Jun 2010 and 11 Jan 2012, leaving more understandable code.

Given that your time variables are date variables, whose unit of measurement is days, but the contracts have pretty long durations, I have added the -scale(365.25)- option to the -stset- command so that results will be denominated in years rather than days. Again, from a computations perspective this changes nothing, but it makes the results easier for humans to grasp. Few people have a handle on the meaning of 1927 days, but most are very comfortable with the equivalent 5.28 years..

Thank you, this will make results more intuitive to grasp and represent.

I have tried running the following syntax based on your suggestion (I am here using the "translated" variable names so that it's easier to follow):

Code:

gen `c(obs_t)' cont_id = _n stset endcontdate, failure(termcont==1) origin(startcontdate) id(cont_id) scale(365.25) stsplit era, at(`=td(17jun2010)' `=td(11jan2012)') sts test (era), logrank sts graph, failure by(era) tmax(1460) stcox i.era, shared(id)

Unfortunately, I'm surely making a mistake or missing something, because after the -stsplit- command I get the following message:

Code:

(no new episodes generated)

Subsequent commands yield no useful results then. -sts test- shows that "era" variable has only a single category, that of value 0.

Code:

Failure _d: fincont==1 Analysis time _t: (F_BAJA_date-origin)/365.25 Origin: time F_ALTA_date ID variable: cont_id Equality of survivor functions Log-rank test | Observed Expected era | events events ------+------------------------- 0 | 9729812 9729812.00 ------+------------------------- Total | 9729812 9729812.00 chi2(0) = 0.00 Pr>chi2 = .

-sts graph- shows a single survival curve, not stratified by observation eras: https://www.dropbox.com/scl/fi/qfw3f...kqq35gy58&dl=0

-stcox- is even more interesting, as it gives me the following message:

Code:

unable to allocate matrix; You have attempted to create a matrix with too many rows or columns or attempted to fit a model with too many variables. You are using Stata/MP which supports matrices with up to 65534 rows or columns. This is the maximum matrix size. If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the required matrix size; see help set emptycells. If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such variables. r(915);

Before I read your suggestion to use -shared(id)-, I ran:

Code:

stcox i.era

Which gave me collinearity.

I tried substituting the date time for days ellapsed, in the following fashion:

Code:

stsplit era, at(18430 19003 )

But to no avail. I'm not sure what I'm missing.
Again, thank you so much for helping me. If you need to clarify anything else or I have missed making anything clear please let me know.

Last edited by Eduard López; 28 Oct 2023, 12:49.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

28 Oct 2023, 13:23

OK. Sorry, there were problems both with my code and with the way your data is set up to start.

Code:

gen `c(obs_t)' cont_id = _n local today = td(28oct2023) replace endcontdate = `today' if endcontdate == td(31dec2099) stset endcontdate, failure(termcont==1) origin(startcontdate) id(cont_id) /// exit(time `today') local interval1 = 1 local interval2 = datediff_frac(td(16jun2010), td(11jan2012), "d") gen era1_start_t = datediff_frac(startcontdate, td(16jun2010), "d") stsplit era, at(`interval1' `interval2') after(_t = era1_start_t) by cont_id (era), sort: replace era = _n-1 sts test (era), logrank sts graph, failure by(era) stcox i.era, shared(id)

The bold face type fixes your data, but you will have to modify this fix. You coded endcontdate as 31dec2099 for those contracts that were still in effect. You can't do that: it distorts the survival analysis because Stata will interpret that to mean that the contract really continued until 31dec2099. It does not know that that date is in the future! You have to put in there an actual date, the last date at which the contract was known to still be in effect. So I did that with today's date. But, in fact, your data was collected some time ago. So you have to replace my value of local today by the actual date on which your data collection ended: that is the actual last date at which the contract was known to still be in effect.

The rest is revision to my code. There are enough changes from before that rather than highlighting them, I suggest you just discard the earlier code and use this. For simplicity of coding, I have eliminated the rescaling from days to years. If you would like to restore the scaling to years, that can be done as follows:

Code:

gen `c(obs_t)' cont_id = _n local today = td(28oct2023) replace endcontdate = `today' if endcontdate == td(31dec2099) stset endcontdate, failure(termcont==1) origin(startcontdate) id(cont_id) /// exit(time `today') scale(365.25) local interval1 = 1/365.25 local interval2 = datediff_frac(td(16jun2010), td(11jan2012), "y") gen era1_start_t = datediff_frac(startcontdate, td(16jun2010), "y") stsplit era, at(`interval1' `interval2') after(_t = era1_start_t) by cont_id (era), sort: replace era = _n-1 sts test (era), logrank sts graph, failure by(era) stcox i.era, shared(id)

Changes shown in italics.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#12

29 Oct 2023, 05:35

Dear professor Schechter, thank you so much. It seems everything is working now. -stsplit- splits the episodes in the three "eras" I'm interested in: before policy 2010, after policy 2010 but before policy 2012 and after policy 2012.

If I understand correctly I don't need to make the "eras" the same length in survival analysis in order to compare them, even though data before 2010 go back to the 70s, after policy 2010 we only have 2 years and after policy 2012 we have 7 years (data was gathered in 2019), right?

Log-rank now shows results: with eras:

Code:

Failure _d: fincont==1 Analysis time _t: (F_BAJA_date-origin)/365.25 Origin: time F_ALTA_date Exit on or before: time 23311 ID variable: cont_id Equality of survivor functions Log-rank test | Observed Expected era | events events ------+------------------------- 0 | 9414950 9430567.31 1 | 253408 217518.14 2 | 61454 81726.55 ------+------------------------- Total | 9729812 9729812.00 chi2(2) = 12371.90 Pr>chi2 = 0.0000

Stratified survival curves now look right: https://www.dropbox.com/scl/fi/3hezx...ki38zf268&dl=0
My question with the curve would be about the time scale. Your syntax for -stset- specified a scale of 365.25, however, do curves show this scale, or what is the meaning of time units in the x axis, days/years? (as it cuts off at 150 but I didn't use the tmax option for -stsgraph-

Now I need to solve two things:
1) When using the shared(id) option in -stcox- I still get the following message:

Code:

unable to allocate matrix; You have attempted to create a matrix with too many rows or columns or attempted to fit a model with too many variables. You are using Stata/MP which supports matrices with up to 65534 rows or columns. This is the maximum matrix size. If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the required matrix size; see help set emptycells. If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such variables. r(915);

I have tried searching over Statalist for anyone getting this message with -stcox- but no dice. Reading about the -shared- option has introduced me to the concept of Cox regression with shared frailty, which I didn't know about before. I'll need to read into it and see if there's a workaround for the issue.

2) Is it possible to introduce another time varying "era" for the economic crisis as mentioned in #1, #2, #3 and #4 in this thread? I'm not sure if it is possible so split in "overlapping" eras, since the economic crisis overlaps with the 2010 policy effect era and with part of the 2012 policy effect era and with the "no policy" effect era (it spans 2008-2014 in the country), as per #3. Or am I going to be forced to create multiple eras such as all the permutations of the table in #3?

Thanks a lot professor Schechter, I think I'm learning more in a couple months working no this than I did in many classes back in my Master's. Appreciate your time and effort.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#13

29 Oct 2023, 12:42

If I understand correctly I don't need to make the "eras" the same length in survival analysis in order to compare them, even though data before 2010 go back to the 70s, after policy 2010 we only have 2 years and after policy 2012 we have 7 years (data was gathered in 2019), right?

Correct.

My question with the curve would be about the time scale. Your syntax for -stset- specified a scale of 365.25, however, do curves show this scale, or what is the meaning of time units in the x axis, days/years? (as it cuts off at 150 but I didn't use the tmax option for -stsgraph-

If you are using the version of the code that has -scale(365.25)- then the time on the horizontal axis is in years. If the graph is going out to 150 years, then there is something wrong in your data: some observation(s) with an endcontdate far in the future. That needs to be corrected.

1) When using the shared(id) option in -stcox- I still get the following message:

Code:

unable to allocate matrix; You have attempted to create a matrix with too many rows or columns or attempted to fit a model with too many variables. You are using Stata/MP which supports matrices with up to 65534 rows or columns. This is the maximum matrix size. If you are using factor variables and included an interaction that has lots of missing cells, try set emptycells drop to reduce the required matrix size; see help set emptycells. If you are using factor variables, you might have accidentally treated a continuous variable as a categorical, resulting in lots of categories. Use the c. operator on such variables. r(915);

This is going to be difficult to work around. Your data set is enormous, with a huge number of contracts and persons. There is a limit on the size of matrices that Stata will allocate (and -shared()- requires the creation of a large matrix). You won't be able to do this. The only viable option I can think of is to select a random sample from your data set that has a small enough number of contracts and people that you don't get this message. Alternatively, don't use the -shared()- option, though it means you would be choosing to fit an incorrect model: your standard errors, test statistics, and confidence intervals will be incorrect. But in a data set this large, that isn't necessarily a problem. Almost anything would be "statistically significant" in a data set this large, and the confidence intervals are all going to be extremely narrow anyway. The whole framework of statistical inference kind of falls apart with data sets this large.

2) Is it possible to introduce another time varying "era" for the economic crisis as mentioned in #1, #2, #3 and #4 in this thread? I'm not sure if it is possible so split in "overlapping" eras, since the economic crisis overlaps with the 2010 policy effect era and with part of the 2012 policy effect era and with the "no policy" effect era (it spans 2008-2014 in the country), as per #3. Or am I going to be forced to create multiple eras such as all the permutations of the table in #3?

Though I have never needed to do it myself, it is my understanding that you can just apply -stsplit- again to accommodate the new era variable. That is, you can -stsplit- a data set that has already been -stsplit- for other events. I'd say give it a try.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#14

30 Oct 2023, 13:24

If you are using the version of the code that has -scale(365.25)- then the time on the horizontal axis is in years. If the graph is going out to 150 years, then there is something wrong in your data: some observation(s) with an endcontdate far in the future. That needs to be corrected.

Thank you so much Professor Schechter. I took a look at the database after using the syntax. There were indeed some contracts which had wonky start dates which were simply impossible (starting in 1910 or so, when the database didn't exist and wasn't collecting data) or unreasonable (giving me way longer tenures than what is possible under national law; while it could be possible for some contracts to extend over retirement age in some cases, they would have been located in industries where this is possible such as academia etc. and they weren't). I cleaned up those observations and the timescale went back to something reasonable (less than 50 years).

This is going to be difficult to work around. Your data set is enormous, with a huge number of contracts and persons. There is a limit on the size of matrices that Stata will allocate (and -shared()- requires the creation of a large matrix). You won't be able to do this. The only viable option I can think of is to select a random sample from your data set that has a small enough number of contracts and people that you don't get this message. Alternatively, don't use the -shared()- option, though it means you would be choosing to fit an incorrect model: your standard errors, test statistics, and confidence intervals will be incorrect. But in a data set this large, that isn't necessarily a problem. Almost anything would be "statistically significant" in a data set this large, and the confidence intervals are all going to be extremely narrow anyway. The whole framework of statistical inference kind of falls apart with data sets this large.

Being a technical limitation, I feel this issue is less worrying. I'll think about taking either of the two options you suggest.

There is however a bizarre detail I have noticed. When plotting the K-M curve, I noticed that era 2 (starting after 2012, when policy 2 is enforced) has a probability of failure (in this case the event, contract termination) of 0 for approximately a time span of almost 2 years. Looking into the data I noticed that the type of contract distribution is very unbalanced between eras, with era 0 having a high percentage of temporary contracts and eras 1 and 2 having a very low percentage of them. This is the curve: https://www.dropbox.com/scl/fi/ak2cf...6q61wk6ey&dl=0

This is the tabulation of temporary contracts by eras:

Code:

Observation | interval | Freq. Percent Cum. ------------+----------------------------------- 0 | 12,223,355 98.01 98.01 1 | 218,413 1.75 99.76 2 | 30,185 0.24 100.00 ------------+----------------------------------- Total | 12,471,953 100.00

And this is the tabulation of types of contracts in era 2:

Code:

Tipo de | contrato | Freq. Percent Cum. ----------------+----------------------------------- Indefinidos | 244,764 77.73 77.73 Temporales | 30,185 9.59 87.31 No consta | 39,956 12.69 100.00 ----------------+----------------------------------- Total | 314,905 100.00

Now, the database itself shows an overall higher percentage of temporary contracts than indefinite ones (since, although the people who have temporary contracts are fewer than the ones who have indefinite contracts, since we are observing the distribution of contracts, there have been many more shorter temporary contracts over time than indefinite contracts, which, logically, last longer and thus are fewer), so I don't understand why the drastic change in era 1 and 2. While it is true that temporary contracts saw a drop in number after the Great Recession started (it was easier to just let them finish than to fire someone who had an indefinite contract and thus would have had the right to receive the severance package established by law), I don't think that alone can explain the change, especially since the recession started in 2008 so it should have affected era 1 too (era 1 is affected, but not to the extent as era 2).
I'm not sure what's the reason for this, and I'm reviewing the syntax step by step to try to determine what's going on. I will be sure to report back.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#15

06 Nov 2023, 09:35

I have been trying (unsuccesfully, I'm afraid) to find out why there is a drop in the number of temporary contracts for the observation interval 2 (-era- variable, created by -stsplit- in the syntax at #11) that I posted in #14.

I ran a different syntax to try to generate a variable for the different "eras" only when a contract finishes during the periods I mentioned (before 2010, between 2010 and 2012, and after 2012, up until 2019 which is the last year data was gathered).

This took the following syntax (it's a bit changed since in my syntax I labelled it and periods had values of 1, 2 and 3, but made them 0, 1 and 2 for easier comparison with the -era- variable):

Code:

gen period=. replace period=0 if F_BAJA_date<= date("20100617","YMD") replace period=1 if F_BAJA_date>= date("20100617","YMD") replace period=2 if F_BAJA_date>= date("20120211","YMD")

-tab period- gives me the following results:

Code:

Periodos de fin de | contrato | Freq. Percent Cum. ----------------------+----------------------------------- 0 | 10,107,879 55.76 55.76 1 | 1,914,566 10.56 66.32 2 | 6,105,059 33.68 100.00 ----------------------+----------------------------------- Total | 18,127,504 100.00

(This is the number of all contracts by period).

Now, I know that that is a far cry from the split in the distribution of episodes that -stsplit- achieves (that is, I know the numbers are not going to be similar). But even if they are not similar numbers, what strikes me is the huge difference between them. Even when trying to fine tune the syntax and make -termcont- (the failure variable) account for all the temporary contract terminations, there is this noticeable difference in the number of temporary contracts between -era- (the variable created with -stsplit-) and -period- (the one created by me with the syntax from above):

Table 1:

Code:

Observation | interval | Freq. Percent Cum. ------------+----------------------------------- 0 | 9,682,303 97.50 97.50 1 | 218,413 2.20 99.70 2 | 30,185 0.30 100.00 ------------+----------------------------------- Total | 9,930,901 100.00

Table 2:

Code:

Periodos de fin de | contrato | Freq. Percent Cum. ----------------------+----------------------------------- 0 | 4,347,409 43.78 43.78 1 | 983,723 9.91 53.68 2 | 4,599,769 46.32 100.00 ----------------------+----------------------------------- Total | 9,930,901 100.00

(This is the number of temporary contracts by period).

The smaller number of temporary contracts from Table 2 in -period- 1 makes sense. This is period 2010-2012, so the number of temporary contracts which finished during those years is smaller than before or after because we are only taking into account two years (plus, incidentally, I posit that most temporary contracts were the first to be terminated when the recession started in 2008). However Table 1 makes no sense to me: why are there so few temporary contracts, especially in -era- 2?

The K-M curve also keeps showing the weird "gap" for observation interval 2, with an event probability of 0 for the first few years: https://www.dropbox.com/scl/fi/lvtav...dk2x86cse&dl=0

The syntax I'm using, based on #11 is:

Code:

gen `c(obs_t)' cont_id = _n local today = td(01dec2019) replace endcontdate = `today' if endcontdate == td(31dec2099) stset endcontdate, failure(termcont==1) origin(startcontdate) id(cont_id) exit(time `today') scale(365.25) local interval1 = 1/365.25 local interval2 = datediff_frac(td(16jun2010), td(11jan2012), "y") gen era1_start_t = datediff_frac(startcontdate, td(16jun2010), "y") stsplit era, at(`interval1' `interval2') after(_t = era1_start_t) by cont_id (era), sort: replace era = _n-1 tab era if tipcont==2 sts test (era), logrank sts graph, failure by(era) stcox i.era, shared(id)

Perhaps what I'm struggling to understand are these specific lines of the code:

Code:

local interval1 = 1/365.25 local interval2 = datediff_frac(td(16jun2010), td(11jan2012), "y") gen era1_start_t = datediff_frac(startcontdate, td(16jun2010), "y") stsplit era, at(`interval1' `interval2') after(_t = era1_start_t) by cont_id (era), sort: replace era = _n-1

Could I get an explanation of what they're telling Stata to do? I'm sure I'm missing something that's in front of me. I know -local- is creating macros and -stsplit- is the command to create the split in episodes for the time varying covariates, but I'm missing the details of why they are taking that form.

Thanks a lot for your help.

Last edited by Eduard López; 06 Nov 2023, 09:39.
Comment

Period variable	Value before 2008	Value 2008-2010	Value 2010-2012	Value 2012-2014	Value After 2014
2010 policy	0	0	1	0	0
2012 policy	0	0	0	1	1
Economic crisis	0	1	1	1	0

Announcement