Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpreting Output of sts list Command

    Hello Statalist,
    I am having an issue running the estimates of a Kaplan Meier estimate via the stset command. Within my study I am looking to understand rates of first marriage at each age among three kinds of migrants (variable mig_type_19). I observe individuals multiple times throughout the study, with the variable person_id uniquely identifying individuals. Individuals can begin the study at ages 15, 16, or 17, which means there is some left truncation. Individuals leave the study after age 24, or upon getting married.

    I have a number of questions regarding the output of the sts list command, and whether or not I am using the stset command correctly. First, when I run stset command and sts list over my three migration types, I get the following

    Code:
    stset age, id(person_id) failure(mar_ind_r)
    
    Survival-time data settings
    
               ID variable: person_id
             Failure event: mar_ind_r!=0 & mar_ind_r<.
    Observed time interval: (age[_n-1], age]
         Exit on or before: failure
    
    --------------------------------------------------------------------------
         16,692  total observations
          1,053  observations begin on or after (first) failure
    --------------------------------------------------------------------------
         15,639  observations remaining, representing
          3,700  subjects
            713  failures in single-failure-per-subject data
         80,222  total analysis time at risk and under observation
                                                    At risk from t =         0
                                         Earliest observed entry t =         0
                                              Last observed exit t =        24
    
    . sts list, f by(mig_type_19)
    
            Failure _d: mar_ind_r
      Analysis time _t: age
           ID variable: person_id
    
    Kaplan–Meier failure function
    By variable: mig_type_19
    
                 At           Net     Failure      Std.
      Time     risk   Fail   lost    function     error     [95% conf. int.]
    ------------------------------------------------------------------------
    Rural Stayer
        15      493      3      1      0.0061    0.0035     0.0020    0.0187
        16      489      3      9      0.0122    0.0049     0.0055    0.0269
        17      477      5     29      0.0225    0.0067     0.0125    0.0403
        18      443     16     33      0.0578    0.0108     0.0400    0.0833
        19      394     22     32      0.1104    0.0149     0.0846    0.1436
        20      340     20     23      0.1628    0.0181     0.1307    0.2018
        21      297     29     15      0.2445    0.0218     0.2049    0.2903
        22      253     21     37      0.3072    0.0239     0.2631    0.3568
        23      195     18     71      0.3712    0.0260     0.3226    0.4245
        24      106     23     83      0.5076    0.0324     0.4460    0.5725
    Rural Mover
        15     2748      2      6      0.0007    0.0005     0.0002    0.0029
        16     2740      3     21      0.0018    0.0008     0.0008    0.0044
        17     2716     23    161      0.0103    0.0019     0.0071    0.0148
        18     2532     35    197      0.0240    0.0030     0.0188    0.0306
        19     2300     37    164      0.0397    0.0039     0.0327    0.0481
        20     2099     54    106      0.0644    0.0050     0.0552    0.0750
        21     1939     48    103      0.0875    0.0059     0.0766    0.0999
        22     1788     65    163      0.1207    0.0070     0.1077    0.1351
        23     1560     88    682      0.1703    0.0084     0.1546    0.1874
        24      790     72    718      0.2459    0.0114     0.2244    0.2691
    Urban Stayer
        15      459      2      0      0.0044    0.0031     0.0011    0.0173
        16      457      1      2      0.0065    0.0038     0.0021    0.0201
        17      454     12      6      0.0328    0.0083     0.0199    0.0538
        18      436     15     13      0.0661    0.0117     0.0467    0.0931
        19      408     10     13      0.0890    0.0134     0.0660    0.1193
        20      385     17     13      0.1292    0.0160     0.1011    0.1643
        21      355     14     13      0.1635    0.0178     0.1318    0.2019
        22      328     17     22      0.2069    0.0197     0.1712    0.2488
        23      289     18    119      0.2563    0.0217     0.2166    0.3017
        24      152     20    132      0.3541    0.0278     0.3027    0.4114
    ------------------------------------------------------------------------
    Note: Net lost equals the number lost minus the number who entered.


    However if I run a tab of the percent married over the same three groups, I get the following:


    Code:
    tab age mig_type_19,
    
               |      Migration Type in 2019
           Age | Rural Sta  Rural Mov  Urban Sta |     Total
    -----------+---------------------------------+----------
            15 |       269      1,521        267 |     2,057
            16 |       276      1,420        260 |     1,956
            17 |       254      1,526        272 |     2,052
            18 |       262      1,453        261 |     1,976
            19 |       217      1,240        242 |     1,699
            20 |       217      1,060        194 |     1,471
            21 |       184        994        205 |     1,383
            22 |       198        986        187 |     1,371
            23 |       167      1,016        211 |     1,394
            24 |       174        956        203 |     1,333
    -----------+---------------------------------+----------
         Total |     2,218     12,172      2,302 |    16,692
    
    . tab age mig_type_19, sum(mar_ind_r) nost nofreq
    
                              Means of First Marriage
    
               |    Migration Type in 2019
           Age | Rural Sta  Rural Mov  Urban Sta |     Total
    -----------+---------------------------------+----------
            15 | .01115242  .00131492  .00749064 | .00340301
            16 | .01811594   .0028169  .01153846 | .00613497
            17 | .03543307  .01703801  .05514706 | .02436647
            18 | .08778626  .03234687  .09578544 | .04807692
            19 | .15668203  .05806452  .12809917 | .08063567
            20 | .23041475  .09339623  .19587629 | .12712441
            21 | .32065217  .12977867  .25365854 | .17353579
            22 | .38888889    .163286  .30481283 | .21517141
            23 | .43113772   .1988189  .32701422 | .24605452
            24 | .52298851  .24895397  .34975369 | .30007502
    -----------+---------------------------------+----------
         Total | .19071235  .08051265  .15768897 | .10579919
    What I am interested in is that are small differences in the failure function and the percent married. For example among rural stayers (the first group) at time/age 15 the failure function = .0061, while the percent married at age 15 = .0115. This is a large difference and can’t be explained by censoring since this is the first year. What is explaining this?

    Second, there are only 269 individuals who are rural stayers at age 15, but the corresponding at risk population is 493. Why is there is a difference between these two outputs? What is the “at risk” population referring to?

    Which one is correct between the sts list and the cross tabs? Based on the information I have provided is there any additional options that I need to include in the stset command.

    Thank you
    Last edited by Matt Brooks; 18 Jan 2022, 15:05.

  • #2
    The discrepancy you see is, I believe, accounted for by the fact that the survival analysis excludes 1,053 observations that begin on or after (first) failure, whereas your -tab- commands do not.

    As for the "at risk" column, recall that the way a life table is built, you start out with a cohort of people (or whatever the entities of the analysis are) who have not experienced the failure event. They are all "at risk" to fail at the beginning. Then, at each subsequent stage, the number at risk declines, being reduced due to elimination of anyone who fails in the interval since the preceding observation or who is censored during that same interval. Otherwise put, at any given time the at risk column gives the number of entities still under observation in the study and have not experienced the failure event.

    Comment


    • #3
      Thank you so much Clyde. For anyone who comes across this in the future, the 1053 observations that are not included in the stsset are observations that happen after an individual gets married. I didn't want to preemptively drop these observations since they don't mess with the results of any subsequent Kaplan Meier curves that I make, and later in my study I look at a different outcome and didn't want to have to reload the dataset.

      Also the differences between the "at risk" population and the cross tabs is that the "at risk" population represent anyone who at any time within the study is a rural stayer, or a rural mover, etc. I found this out by running "distinct person_id if mig_type_19 == 1" and seeing that there are 493 distinct individuals who at any point are rural stayers (whether starting the study at ages 15,16 ,17). To further explain, if for example, an individual enters the study at age 16 as unmarried, the stset command assumes that the individual also existed at age 15 and were not married and calculates accordingly.

      Thanks again

      Comment

      Working...
      X