Cox model with two time varying independent variables: not sure how to approach it

Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#16

06 Nov 2023, 10:51

In any survival analysis data set there are two ways of looking at time. One is time as we measure it on a calendar or watch. Let's call that calendar time. And the other is analysis time. Analysis time is calculated in -stset- and stored in variable _t. For each person, _t = 0 at the time when the person first comes under observation in the study. This time is specified in -stset-'s -origin()- option. It then runs at a rate determined by -stset-'s -scale()- option.

So, in your situation, calendar time has different starting points for each contract, and it runs in units of days. Analysis time starts at 0 on the start date of each contract and runs in units of years. The code for which you are asking explanations is used to enable us to tell -stsplit- what to do using analysis time, which is what -stsplit- understands most easily.

Now, the -stsplit- command looks like this: -stsplit era, at(`interval1' `interval2') after(_t = era1_start_t)-. So we have to understand what interval1, interval2 and era1_start_t are. Let's work backwards from right to left.

-gen era1_start_t = datediff_frac(startcontdate, td(16jun2010), "y")- creates a new variable, era1_start_t defined as the time interval in years from the start date of the contract until 16 jun 2010--the latter being the day before the start of era 1. (The reason why I didn't use 17 jun 2010, the actual start date, will become clear shortly.) In other words, it is the value of _t, for the particular contract, on 16 jun 2010.

-local interval2 = datediff_frac(td(16jun2010), td(11jan2012), "y")- define a local macro containing the time interval in years from 16jun2010 to 11jan2012, the latter being the start date of era 2.

-local interval1 = 1/365.25- defines local macro interval 1 to be 1/365.25, which is the duration of 1 day measured in years.

So when we run -stsplit era, at(`interval1' `interval2') after(_t = era1_start_t)-, we are asking Stata to split each observation into eras, the first of which ends 1 day after 16jun2010, i.e. on 17jun2010. It might have been clearer to set `interval1' to 0 and have era1_start_t be 17jun2010 itself--but unfortunately, the -at()- observation requires positive numbers.

The next era begins `interval2' years after era1_start_date. Since interval2 is defined as the number of years between 11jan2012 and 16jun2010, and era1_start_date is the number of years from the contract start date to 16jun2010, this means that the next era begins on 11jan2012.

So the net effect of this has been to provide -stsplit- with the times at which to split the observations in the analysis time metric.

Finally we have -by cont_id (era), sort: replace era = _n-1-. -stsplit- defines the variable era to be the analysis time values of the times at which it has defined the new eras. But what you need for your analysis is not a variable containing different values for each contract, and those values being counts and fractions of years, but a discrete 1, 2, 3 variable, this command changes it. The observations are sorted in chronological order within cont_id and then the first has era replaced by 1, the second by 2, and the third by 3.

I must say I am puzzled by the results you are getting, and I do not understand where things are going wrong.
Comment

Eduard López

Join Date: Dec 2014
Posts: 48

#17

06 Nov 2023, 13:28

Thank you for the explanation, professor Schechter. It makes perfect sense.
I have been trying to retrace every step in hopes of finding out where things go wrong. This is very tentative, as I haven't been able to devote the necessary time, but I found out that after:

Code:

by cont_id (era), sort: replace era = _n-1

There is (what looks like?) a change in the distribution of contracts.
Before that command, these are the results of -tab era-:

Code:

Observation |
   interval |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 14,267,438       56.90       56.90
   .0027379 |  2,193,372        8.75       65.64
   1.571038 |  8,615,309       34.36      100.00
------------+-----------------------------------
      Total | 25,076,119      100.00

It does, at first sight, look "closer" to the distribution that would be expected. A -tab era if tipcont==2- (that is, a tabulation of -era- only for temporary contracts) yields the following:

Code:

Observation |
   interval |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |  4,461,029       45.18       45.18
   .0027379 |    836,218        8.47       53.65
   1.571038 |  4,577,181       46.35      100.00
------------+-----------------------------------
      Total |  9,874,428      100.00

Which is also in the same vein.

However, after the -replace- command, it looks like this:

Code:

Observation |
   interval |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 31,178,035       95.32       95.32
          1 |  1,110,567        3.40       98.72
          2 |    420,240        1.28      100.00
------------+-----------------------------------
      Total | 32,708,842      100.00

And, as mentioned in #15, a tabulation only for temporary contracts looks like this:

Code:

Observation |
   interval |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 | 12,223,355       98.01       98.01
          1 |    218,413        1.75       99.76
          2 |     30,185        0.24      100.00
------------+-----------------------------------
      Total | 12,471,953      100.00

I suspect that something happens during the the -by cont_id (era), sort: replace era = _n-1- command, so I'm going to need to study it in detail.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#18

06 Nov 2023, 17:28

Yes, that is where things are getting messed up. I see it now! The problem arises with contracts that begin on or after 17 JUN 2010. They end in era 2 (as we intend it to be) but they are never part of era 0 or era 1. So after -stsplit- they have only one observation, as the split points never occur in the interval during which the contract runs. Consequently when we "recode" era in that command, it gets recoded to 0, which is very, very wrong. Basically my code is only correct for contracts that extend over all three eras. So, we can fix this. It is tempting to replace that one line with -replace era = 1 if era == .0027379- and -replace era = 2 if era == 1.571038- based on the values of era that -stsplit- provides. But this can easily go wrong, as floating point exact comparison can be subverted by precision errors. So, to be bullet-proof, and also to write code that won't have to be modified if other changes are made replace that bad line of code with:

Code:

frame put era, into(eras) frame eras { duplicates drop isid era, sort gen int era2 = _n-1 } frlink m:1 era, frame(eras) replace era = frval(eras, era2)

I apologize for the earlier incorrect code and the extra hours it caused you to put into this, as well as perhaps extra anxiety.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#19

07 Nov 2023, 09:07

Please Professor Schechter, don't apologize at all. You have helped me out of good will and generosity, and that is something for which I'm inmensely grateful. If anything, spending some time on trying to finding out where the issue was has helped me to better understand the whole process and to sharpen my Stata skills a bit (I didn't know about macros before reading your code, for instance!).

I have tried running the code you provided in #18 instead of the one I mentioned in #17, but unfortunately I get some error messages.
After - isid era, sort- I get the following error message:

Code:

variable era should never be missing

Then, after -replace era = frval(eras, era2)- I get the following message:

Code:

variable era2 not found

Logrank test and K-M curves run without errors (although the K-M- curves are stratified by the -era- variable, and these retain their 0, 0.0027379 and 1.571038 values instead of having been recoded to 0, 1 and 2) and indeed the K-M curves do not show the weird "gap" for the initial duration of one of them as they did before. -stcox- however throws another error message, I guess derived from the unsucessful recoding of -era-, as follows:

Code:

era: factor variables may not contain noninteger values

Would there be any way to fix the -isid- and -replace- lines of code?
Again, I am really grateful for your time and effort.

PS: In case it helps to see the whole log, it looks like this (some variable names might look different since I translated them when I first posted them in this thread; F_ALTA_date I first translated as startcontdate and F_BAJA_date as endcontdate, fincont as termcont).

Code:

. frame put era, into(eras) . . frame eras { . . duplicates drop Duplicates in terms of all variables (18,127,500 observations deleted) . . isid era, sort variable era should never be missing r(459); . . gen int era2 = _n-1 . . } r(459); . . frlink m:1 era, frame(eras) (all observations in frame default matched) . . replace era = frval(eras, era2) variable era2 not found r(111); . sts test (era), logrank Failure _d: fincont==1 Analysis time _t: (F_BAJA_date-origin)/365.25 Origin: time F_ALTA_date Exit on or before: time 21884 ID variable: cont_id Equality of survivor functions Log-rank test | Observed Expected era | events events ---------+------------------------- 0 | 9145416 9875731.38 .0027379 | 840436 706815.28 1.571038 | 4660039 4063344.34 ---------+------------------------- Total | 14645891 1.46e+07 chi2(2) = 170563.55 Pr>chi2 = 0.0000 . . sts graph, failure by(era) Failure _d: fincont==1 Analysis time _t: (F_BAJA_date-origin)/365.25 Origin: time F_ALTA_date Exit on or before: time 21884 ID variable: cont_id . . . . stcox i.era Failure _d: fincont==1 Analysis time _t: (F_BAJA_date-origin)/365.25 Origin: time F_ALTA_date Exit on or before: time 21884 ID variable: cont_id era: factor variables may not contain noninteger values r(452);

Last edited by Eduard López; 07 Nov 2023, 09:10.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#20

07 Nov 2023, 09:22

So, the whole thing here begins with the error message warning that era should never be missing. And, indeed, it shouldn't. This means that something went wrong with -stsplit-, which creates that variable. Everything that you see later in the code is just a consequence of the fact that era had a missing value somewhere. That led to it not being recoded, which led to it not being updated in the default frame, which led to it having non-integer values (because the original values were not integers.)

So the real question is why there was a missing value of era after -stsplit- ran. The example data you have posted in this thread does not produce this problem. Please post back with a -dataex- example of data that does cause this problem so I can try to troubleshoot it. To identify some data that will do that, you can re-run your code, and after it breaks at the point where it says era should never be missing, do this:

Code:

frame eras: list id cont_id if missing(era)

That will give you a list of offending contracts. So you can select a data example that includes several of those (or all of them if they are not very numerous).

By the way, as a general rule, when Stata stops with an error message, you should not then try to run the rest of the code. Those error messages are there because Stata has tried to verify some condition that is necessary for the subsequent code to run correctly (or even run at all). So if you try to run the rest of the code, the best you can hope for is wrong answers that look plausible; more often you just get a cascade of more error messages (as here).

Sorry, it should just be:

Code:

list id cont_id if missing(era)

Run this immediately after -stsplit-.

Last edited by Clyde Schechter; 07 Nov 2023, 09:55.
Comment

Eduard López

Join Date: Dec 2014
Posts: 48

#21

07 Nov 2023, 10:58

I ran the -list id cont_id if missing(era)- command right after -stsplit- and got a large list of observations. I'm pasting just a fraction of them (if it's too few, let me know and I will add more). I apologize for not using -dataex-, I simply copy pasted as I'm not sure how to use -dataex- to do this, the only way I've used it is to produce data examples of a list of variables (but I don't know how to use it together with -list-).

Code:

18125679. |  765915    6452556 |
18125680. | 3715233   16345565 |
18125681. | 3709137   16336926 |
18125682. | 3705755   16328739 |
          |--------------------|
18125683. | 3703150   16325317 |
18125684. | 2905957   14741828 |
18125685. | 1282471   11013734 |
18125686. | 1007691    8491254 |
18125687. | 1336795   11506501 |
          |--------------------|
18125688. | 3701625   16318634 |
18125689. | 3437221   15816251 |
18125690. | 3700005   16314446 |
18125691. | 3690566   16299419 |
18125692. | 3690558   16299369 |
          |--------------------|
18125693. | 4244112   16993439 |
18125694. | 3684372   16288167 |
18125695. | 3680176   16285363 |
18125696. | 2862796   14624388 |
18125697. |  104089     921491 |
          |--------------------|
18125698. | 3678761   16277761 |
18125699. | 3555243   16045763 |
18125700. | 3755995   16429865 |
18125701. | 3662139   16241797 |
18125702. | 3659287   16235989 |
          |--------------------|
18125703. | 3659191   16235317 |
18125704. | 3635846   16224292 |
18125705. | 1642152   13205330 |
18125706. | 3614445   16174906 |
18125707. | 3607804   16158815 |
          |--------------------|
18125708. | 3962355   16718700 |
18125709. | 1074476    9072249 |
18125710. | 3831862   16537554 |
18125711. | 3577934   16093546 |
18125712. | 3574658   16086423 |
          |--------------------|
18125713. |    2806      31241 |
18125714. | 2799739   14444098 |
18125715. | 3570421   16077669 |
18125716. | 3265396   15536498 |
18125717. | 3564123   16066389 |
          |--------------------|
18125718. | 3556309   16053590 |
18125719. | 3730524   16374669 |
18125720. | 1044556    8807825 |
18125721. | 3545318   16023509 |
18125722. | 3542733   16016773 |
          |--------------------|
18125723. | 3585728   16113588 |
18125724. |  312145    2517360 |
18125725. | 4212568   16970729 |
18125726. | 3505389   15973290 |
18125727. | 3499129   15960351 |
          |--------------------|
18125728. | 3494907   15951656 |
18125729. | 2993739   14934123 |
18125730. | 3486239   15928280 |
18125731. | 1168098   10081539 |
18125732. | 4274621   17017772 |
          |--------------------|
18125733. | 3864679   16594308 |
18125734. | 3465593   15881762 |
18125735. |   56298     472348 |
18125736. | 3766490   16454250 |
18125737. | 3462660   15875231 |
          |--------------------|
18125738. |  528620    4500363 |
18125739. | 3460189   15870351 |
18125740. | 3459510   15867270 |
18125741. | 3458774   15864228 |
18125742. |  530586    4511194 |
          |--------------------|
18125743. | 3458428   15862895 |
18125744. | 3457673   15860817 |
18125745. | 3428779   15798517 |
18125746. | 3425393   15790162 |
18125747. | 3386114   15764537 |
          |--------------------|
18125748. | 3251822   15499531 |
18125749. | 3378444   15749029 |
18125750. | 3375566   15744166 |
18125751. | 1007853    8492923 |
18125752. | 3362963   15717129 |
          |--------------------|
18125753. | 3359155   15707606 |
18125754. | 3626179   16199191 |
18125755. | 3346693   15680551 |
18125756. |  530249    4505975 |
18125757. | 3316229   15615803 |
          |--------------------|
18125758. | 4087974   16859700 |
18125759. |  588936    5127382 |
18125760. | 2952666   14821792 |
18125761. | 3458383   15862792 |
18125762. | 2636991   14029621 |
          |--------------------|
18125763. | 4214129   16972133 |
18125764. |  518762    4373121 |
18125765. | 3264835   15535541 |
18125766. | 3261760   15528120 |
18125767. | 2615248   13947697 |
          |--------------------|
18125768. | 3144349   15279181 |
18125769. | 1487155   12534653 |
18125770. |  750521    6272654 |
18125771. | 1254929   10798027 |
18125772. | 3023397   15011563 |
          |--------------------|
18125773. |  611209    5391241 |
18125774. | 1133174    9777717 |
18125775. | 3226246   15440711 |
18125776. | 2529554   13731304 |
18125777. |  421219    3524695 |
          |--------------------|
18125778. | 3195810   15372323 |
18125779. | 3894966   16649945 |
18125780. | 3195262   15370141 |
18125781. | 1513927   12665443 |
18125782. | 3169863   15342750 |
          |--------------------|
18125783. | 3157678   15311145 |
18125784. | 1539449   12760391 |
18125785. |  693021    5789309 |
18125786. | 3126284   15234737 |
18125787. | 3119452   15222999 |

Thanks a lot.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#22

07 Nov 2023, 11:49

No, sorry if I wasn't clear. The entire list of id's and cont_id's after -stsplit- is not useful. You need to run:

Code:

list id cont_id if missing(era)

immediately after -stsplit-.

Then you need to create a new -dataex-, with all of the variables used in your previous -dataex-'s, but making sure that the sample includes several of the id cont_id observations that showed up in the list. Remember that, like many Stata commands, -dataex- allows -if- and -in- condtions, so you can specifically tell -dataex- which observations you want it to have in the example.

To be completely clear, the id's and cont_id's by themselves are not helpful. They are just a step towards finding the right observations to include in the new -dataex- that will reproduce the problem of missing values for the variable era so that I can try to troubleshoot that problem. It is that problem we must fix, and it does not appear in any of the example data that you have posted up until now.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#23

07 Nov 2023, 11:59

So, if I understand correctly, I need to use -stsplit-, then

Code:

list id cont_id if missing(era)

And then -dataex- with all my covariates of interest including an if condition for missing era? As in, for example:

Code:

dataex startcontdate endcontdate termcont cont_id id tipcont if era==., count 10
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#24

07 Nov 2023, 12:56

That's pretty much it but not quite. Here's a simpler way. Forget about listing the id's and cont_id's. Please run the following code immediately after -stsplit-

Code:

by id cont_id, sort: egen to_view = max(missing(era)) dataex startcontdate endcontdate termcont cont_id id tipcont _* era if to_view
Comment

Eduard López

Join Date: Dec 2014
Posts: 48

#25

07 Nov 2023, 13:18

Got it. Here's the result:
(F_ALTA_date is startcontdate, F_BAJA_date is endcontdate, fincont is termcont).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(F_ALTA_date F_BAJA_date fincont) long(cont_id id) float tipcont byte(_merge _st _d) int _origin double(_t _t0) float era
21889 21920 1   294   71 2 3 0 . 21889 . . .
21899 21884 .  2534  444 2 3 0 . 21899 . . .
21889 21921 1  3790  591 2 3 0 . 21889 . . .
21900 21902 1  5575  733 2 3 0 . 21900 . . .
21908 21934 1  5576  733 2 3 0 . 21908 . . .
21884 21945 1  6532  785 2 3 0 . 21884 . . .
21885 21914 1  6886  804 2 3 0 . 21885 . . .
21885 21887 1  6911  806 2 3 0 . 21885 . . .
21903 21884 .  7515  859 2 3 0 . 21903 . . .
21888 21906 1  8578  930 2 3 0 . 21888 . . .
21892 21898 .  9095  976 1 3 0 . 21892 . . .
21885 21929 1  9208  977 2 3 0 . 21885 . . .
21889 21891 1 10488 1103 2 3 0 . 21889 . . .
21910 21912 1 10489 1103 2 3 0 . 21910 . . .
21893 21897 1 11249 1149 2 3 0 . 21893 . . .
21907 21916 1 11250 1149 2 3 0 . 21907 . . .
21884 21884 . 12009 1207 2 3 0 . 21884 . . .
21892 21912 . 13503 1331 2 3 0 . 21892 . . .
21885 21884 . 13674 1339 2 3 0 . 21885 . . .
21885 21899 1 13821 1361 2 3 0 . 21885 . . .
21900 21913 1 13822 1361 2 3 0 . 21900 . . .
21885 21914 1 15793 1566 2 3 0 . 21885 . . .
21899 21884 . 16018 1582 2 3 0 . 21899 . . .
21892 21905 1 16296 1609 2 3 0 . 21892 . . .
21906 21923 1 16297 1609 2 3 0 . 21906 . . .
21899 21912 1 16637 1632 2 3 0 . 21899 . . .
21913 21919 1 16638 1632 2 3 0 . 21913 . . .
21896 21897 1 17700 1729 2 3 0 . 21896 . . .
21901 21884 . 19150 1824 3 3 0 . 21901 . . .
21910 22011 1 20237 1924 2 3 0 . 21910 . . .
21885 21884 . 22148 2116 2 3 0 . 21885 . . .
21904 21905 1 23709 2217 2 3 0 . 21904 . . .
21907 21914 1 25101 2317 2 3 0 . 21907 . . .
21910 21911 1 28305 2576 2 3 0 . 21910 . . .
21889 21890 1 28812 2593 2 3 0 . 21889 . . .
21897 21898 1 28813 2593 2 3 0 . 21897 . . .
21904 21905 1 28814 2593 2 3 0 . 21904 . . .
21888 21897 1 30089 2705 2 3 0 . 21888 . . .
21885 21887 1 30422 2730 2 3 0 . 21885 . . .
21895 21897 1 30423 2730 2 3 0 . 21895 . . .
21906 21907 1 30424 2730 2 3 0 . 21906 . . .
21885 21886 1 31240 2806 2 3 0 . 21885 . . .
21892 21893 1 31241 2806 2 3 0 . 21892 . . .
21894 21884 . 31242 2806 2 3 0 . 21894 . . .
21897 21898 1 31674 2827 2 3 0 . 21897 . . .
21896 21897 1 32810 2934 2 3 0 . 21896 . . .
21909 21884 . 33279 2965 2 3 0 . 21909 . . .
21902 21917 1 33285 2965 2 3 0 . 21902 . . .
21914 21884 . 36495 3428 2 3 0 . 21914 . . .
21906 21884 . 37234 3524 2 3 0 . 21906 . . .
21885 21903 1 40387 4161 2 3 0 . 21885 . . .
21885 21920 1 42665 4337 2 3 0 . 21885 . . .
21884 21914 1 44509 4398 2 3 0 . 21884 . . .
21885 21884 . 44546 4400 2 3 0 . 21885 . . .
21895 21884 . 44870 4411 1 3 0 . 21895 . . .
21892 21898 . 45060 4413 1 3 0 . 21892 . . .
21885 21884 . 46896 4503 3 3 0 . 21885 . . .
21894 21909 1 50680 4783 2 3 0 . 21894 . . .
21912 21914 1 50681 4783 2 3 0 . 21912 . . .
21904 21923 1 50689 4783 3 3 0 . 21904 . . .
21888 21893 1 51701 4861 2 3 0 . 21888 . . .
21896 21900 1 51702 4861 2 3 0 . 21896 . . .
21902 21906 1 51703 4861 2 3 0 . 21902 . . .
21911 21914 1 51704 4861 2 3 0 . 21911 . . .
21893 21935 . 51766 4866 2 3 0 . 21893 . . .
21886 21884 . 52487 4936 3 3 0 . 21886 . . .
21907 21917 1 52931 4969 2 3 0 . 21907 . . .
21906 21910 1 53259 4986 2 3 0 . 21906 . . .
21913 21915 1 53260 4986 2 3 0 . 21913 . . .
21899 21901 1 53455 5001 2 3 0 . 21899 . . .
21896 21897 1 53550 5003 2 3 0 . 21896 . . .
21903 21904 1 53551 5003 2 3 0 . 21903 . . .
21906 21907 1 53552 5003 2 3 0 . 21906 . . .
21910 21914 1 53553 5003 2 3 0 . 21910 . . .
21892 21894 1 53564 5003 2 3 0 . 21892 . . .
21884 21887 1 54081 5027 2 3 0 . 21884 . . .
21893 21898 1 54091 5027 2 3 0 . 21893 . . .
21900 21901 1 54092 5027 2 3 0 . 21900 . . .
21899 21907 1 54207 5038 2 3 0 . 21899 . . .
21889 21891 1 56330 5199 2 3 0 . 21889 . . .
21897 21898 1 56331 5199 2 3 0 . 21897 . . .
21904 21905 1 56332 5199 2 3 0 . 21904 . . .
21885 21914 1 56649 5211 2 3 0 . 21885 . . .
21889 21884 . 57429 5263 2 3 0 . 21889 . . .
21900 21884 . 60009 5748 1 3 0 . 21900 . . .
21904 21914 1 60029 5750 2 3 0 . 21904 . . .
21900 21903 1 61135 5828 2 3 0 . 21900 . . .
21909 21914 1 61136 5828 2 3 0 . 21909 . . .
21907 21911 1 61310 5836 2 3 0 . 21907 . . .
21905 21915 1 61980 5880 2 3 0 . 21905 . . .
21899 21884 . 61986 5880 2 3 0 . 21899 . . .
21887 21945 1 62397 5898 2 3 0 . 21887 . . .
21885 21898 1 64141 6016 2 3 0 . 21885 . . .
21906 21920 1 64142 6016 2 3 0 . 21906 . . .
21885 21884 . 64171 6019 2 3 0 . 21885 . . .
21892 21897 1 65621 6131 2 3 0 . 21892 . . .
21897 21933 1 65990 6148 2 3 0 . 21897 . . .
21884 21945 1 66030 6149 2 3 0 . 21884 . . .
21888 21982 1 66847 6188 2 3 0 . 21888 . . .
21907 21915 1 68631 6279 2 3 0 . 21907 . . .
end
format %td F_ALTA_date
format %td F_BAJA_date
label values tipcont tipcontlabel
label def tipcontlabel 1 "Indefinidos", modify
label def tipcontlabel 2 "Temporales", modify
label def tipcontlabel 3 "No consta", modify
label values _merge _merge
label def _merge 3 "Matched (3)", modify

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#26

07 Nov 2023, 14:24

Well, in one sense there is a simple fix for this. But I fear that what I'm seeing suggests a still deeper and more puzzling problem further back in the code.

The observations that are causing the problem (missing value for era) are all observations with _st == 0, which means that when the data was -stset-, Stata decided that these observations were invalid and cannot be included in the survival analysis. I assume that your -stset- command looks like

Code:

stset F_BAJA_date, failure(fincont = 1) origin(F_ALTA_date) scale(365.25)

Please show me the output from the Results window that Stata gave you with the -stset- command: I want to try to understand why Stata rejected these observations. (For a subset of them I can see the reason: F_ALTA_date > F_BAJA_date is not allowable--you can't have a contract that expires before it begins. But the rest look OK to me. So I want to see what Stata said about them.)

So I want to pursue that, because the exclusion of (most of) these observations seems to be inappropriate and I want to get to the root of it.

However, with that said, the missingness of era follows naturally from their exclusion by -stset-. I just didn't realize when I wrote the code that there would be excluded observations. So the fix for that is simple:

Code:

frame put era if _st == 1, into(eras) frame eras { duplicates drop isid era, sort gen int era2 = _n-1 } frlink m:1 era, frame(eras) replace era = frval(eras, era2)

This will prevent excluded observations from contributing to the list of era values, which in turn will eliminate the missing values, which will allow -isid era, sort- to run without error, which in turn will lead to the creation of era2, which in turn will lead to era being correctly replaced, which will eliminate the non-integer values of era and allow you to use i.era in your analyses.

But, as I said further up, it isn't at all clear to me why some of these observations were excluded by -stset- and I really want to see what the Stata output from your -stset- command was because I think Stata is telling us there is something seriously wrong in the data, but I cannot see what it is. I have some other questions related to this. I notice that your -datatex- output shows a variable _merge, which I assume came as a result of some -merge- or -joinby- command. Did that command take place before or after -stset-? And, if after, what variables were brought into the data set by that command?
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#27

08 Nov 2023, 10:33

The observations that are causing the problem (missing value for era) are all observations with _st == 0, which means that when the data was -stset-, Stata decided that these observations were invalid and cannot be included in the survival analysis. I assume that your -stset- command looks like

Code:

stset F_BAJA_date, failure(fincont = 1) origin(F_ALTA_date) scale(365.25)

Please show me the output from the Results window that Stata gave you with the -stset- command: I want to try to understand why Stata rejected these observations. (For a subset of them I can see the reason: F_ALTA_date > F_BAJA_date is not allowable--you can't have a contract that expires before it begins. But the rest look OK to me. So I want to see what Stata said about them.)

Yes, very close, my -stset- command is as follows:

Code:

stset F_BAJA_date, failure(fincont==1) origin(F_ALTA_date) id(cont_id) exit(time `today') scale(365.25)

This includes the -id- option with the variable -cont_id- which you helped me create. It also includes the -exit- option with the macro of today which I set at -td(01dec2019)-, for the last day of data collection for the database.

Here's the output of -stset-:

Code:

Survival-time data settings ID variable: cont_id Failure event: fincont==1 Observed time interval: (F_BAJA_date[_n-1], F_BAJA_date] Exit on or before: time 21884 Time for analysis: (time-origin)/365.25 Origin: time F_ALTA_date -------------------------------------------------------------------------- 17038233 total observations 11,216 observations end on or before enter() 53,111 observations begin on or after exit -------------------------------------------------------------------------- 16973906 observations remaining, representing 16973906 subjects 14645891 failures in single-failure-per-subject data 16754529.5 total analysis time at risk and under observation At risk from t = 0 Earliest observed entry t = 0 Last observed exit t = 49.66735

But, as I said further up, it isn't at all clear to me why some of these observations were excluded by -stset- and I really want to see what the Stata output from your -stset- command was because I think Stata is telling us there is something seriously wrong in the data, but I cannot see what it is. I have some other questions related to this. I notice that your -datatex- output shows a variable _merge, which I assume came as a result of some -merge- or -joinby- command. Did that command take place before or after -stset-?

The database is indeed the merging and appending of several smalles files. The original database is a modular one, with different files including different groups of observations and sets of variables. So you have the personal data in the "personal data module" (which is a single file), but you have all the data of the "job contract data module" spread across several files, with each file (job1.dta, job2dta, etc.) containing a number of observations. In order to work with all the "job contract data" you need to merge all those different files for the "job contract data module" each with the "personal data module" file (so you first merge "job1.dta" with "person.dta", then you merge "job2.dta" with "person.dta" etc.). Then, each one of those merged files need to be appended together. If you also wanted to work with pensions, or with fiscal information, you would also need to merge the other files from the "social security contribution", "pensions module" and "fiscal information module". In this, I'm following a (modified version of the) syntax that another researcher who has worked with the same database kindly shared with me.
All observations are linked through a variable (-v1- in the original file, recoded to -id- in my syntax for easier identification) with a 15 digit value that identifies the person to whom each observation (a job contract, a social security contribution etc) account belongs to.
I'm only working with the personal data module and the job contract data module. In order to do so I first had to merge and append the files as described above.
So I used:

Code:

use "filelocation of job1.dta", clear sort id merge m:1 id using "file location of person.dta" tab _merge drop if _merge==1 save, replace

For each "job contract data" file to mege each one with the "personal data module" file. Then I used:

Code:

use "file location of job1.dta" append using "file location of job2.dta" append using "file location of job3.dta" append using "file location of job4.dta" save "jobperson.dta"

Hopefully this makes sense, I didn't remember it was this confusing (I worked on this almost a year ago).

And, if after, what variables were brought into the data set by that command?

I got -_merge- (which was used, as seen above, to drop the unmatched observations). Before the whole merging and appending process I had also used the -flag- variable to locate and remove duplicates (by using -duplicates tag id, gen(flag)-).

Last edited by Eduard López; 08 Nov 2023, 10:36.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#28

08 Nov 2023, 11:12

OK. This is great. The change to the code I suggested in #26 should solve the problem you have been having.

I have two worries about the data that I suggest you look into before proceeding.
You have observations where F_ALTA_date >= F_BAJA_date. That is, these contracts ostensibly end before or on the same day when they begin. Clearly these are data errors. It is easy enough to find them and omit them, but this may be a symptom of a deeper problem: why does your data have such observations? Did you do something wrong in the data management that created them? Maybe the original dates were in the other order and something you did switched them around? Or perhaps one of the source data sets you used to put your data set together is, itself, erroneous. I would put some serious effort into looking into this, because in the course of figuring this out you may find that there are also other problems with the data, and then you should fix those. The existence of these observations with obviously impossible dates casts doubt on the validity of the rest of the data.

You have contracts that begin after 1 DEC 2019. Since you defined that date as the end of your study, these observations also should not be there. Now, this isn't necessarily a serious problem. I could imagine, for instance, that your intent was to study contracts up to 1 DEC 2019 and no farther for good reasons, but that the data sets you had available for the purpose were created by others (or by you at another time) with a different use in mind, so they contain some contracts that begin later than 1 DEC 2019. In that case, this is just a minor issue. But I would urge you to review the data management with a view to making sure that these contracts that begin after 1 DEC 2019 really are innocently included in the data set and did not arise due to a data management error.

Once you clear up those issues, I do believe you will be good to go, at last.
Comment
Eduard López

Join Date: Dec 2014

Posts: 48
#29

30 Nov 2023, 11:59

Originally posted by Clyde Schechter View Post

OK. This is great. The change to the code I suggested in #26 should solve the problem you have been having.

I have two worries about the data that I suggest you look into before proceeding.
You have observations where F_ALTA_date >= F_BAJA_date. That is, these contracts ostensibly end before or on the same day when they begin. Clearly these are data errors. It is easy enough to find them and omit them, but this may be a symptom of a deeper problem: why does your data have such observations? Did you do something wrong in the data management that created them? Maybe the original dates were in the other order and something you did switched them around? Or perhaps one of the source data sets you used to put your data set together is, itself, erroneous. I would put some serious effort into looking into this, because in the course of figuring this out you may find that there are also other problems with the data, and then you should fix those. The existence of these observations with obviously impossible dates casts doubt on the validity of the rest of the data.

You have contracts that begin after 1 DEC 2019. Since you defined that date as the end of your study, these observations also should not be there. Now, this isn't necessarily a serious problem. I could imagine, for instance, that your intent was to study contracts up to 1 DEC 2019 and no farther for good reasons, but that the data sets you had available for the purpose were created by others (or by you at another time) with a different use in mind, so they contain some contracts that begin later than 1 DEC 2019. In that case, this is just a minor issue. But I would urge you to review the data management with a view to making sure that these contracts that begin after 1 DEC 2019 really are innocently included in the data set and did not arise due to a data management error.

Once you clear up those issues, I do believe you will be good to go, at last.

Thank you professor Schechter. I think I managed to solve these two issues and also to incorporate the Great Recession as a time dependent covariable using -stsplit- with a variation of the syntax you posted in #9. The two issues you mentioned stem from a simple confusion on my part: I thought the last date for data gathering was 1 dec 2019, but it was actually 31 dec 2019.
Some very short contracts apparently started after 1 dec 2019 and didn't end before data was collected, so they were assigned the 31 dec 2099 value for -F_BAJA_date- which in this data base is the placeholder value for a contract which hadn't finished when data was gathered. As I was changing that date to what I (mistakenly) thought was the last date of data gathering (1 dec 2019), I ended up with contracts that could, for example, have a value for -F_ALTA_date- of 15 dec 2019 but a value for -F_BAJA_date- of 1 dec 2019. Clearly wrong! This was easily solved: I changed -today- to 1 jan 2020, as no more data was gathered starting on that date.

I truncated the start date of the period studied, since I had before included all job contracts from 1980, but, as I was planning to include the Great Recession as a time dependent covariable, it would have biased results since I wasn't going to include all the previous economic crises the country experienced since 1980. So I was advised to truncate the data so that the start of the period of following is 1 jan 2000. I am now working on getting the output of the extended Cox model into a spreadsheet and then studying it and writing the results part. At first glance, the categories for the different labour reforms do indeed have a statistically significant higher HR than the reference category of no labour reform. Both the Great Recession and the time period after the Great Recession ended also have higher HR than the period before it.

Again, many thanks for all your help, you have been very kind. I will report back when I work a bit more on the conclusions I can get from the data.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment