Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help interpreting the hazard ratio of a variable with positive and negative values in a Cox regression model

    Hi Statalist.

    I created a variable that measures 'Age Difference by Gender' within male and female couples as I want to test if gender has an effect with respect to age differences in relationships (I've included the used code below [legend] 'hgsex' & 'hgage' is the sex and age of the respondent and 'p_hgsex' & 'p_hgage' is the sex and age of their partner):
    Code:
    gen wanted = cond(hgsex == 1 , hgage - p_hgage, cond(hgsex == 2, p_hgage - hgage, .))
    As such, this variable contains both positive values (the number of years that the male is older than the female) and negative values (the number of years that the female is older than the male). I added it to my Cox regression model as an indicator variable (i.agediff), however, Stata noted a variable cannot contain negative values, so I changed it to a continuous variable (c.agediff) and received the following output:

    Code:
          t   Haz. Ratio   Std. Err.      z    P>|z|    [95% Conf. Interval]
      agediff   .9566857   .0552811     -0.77  0.443    .8542469    1.071408
    Can I include such a variable in a Cox regression and if so how do I interpret the result. (This has been reposted from https://www.statalist.org/forums/for...er-stata/page2, as it was at odds with the original thread topic).
    Last edited by Chris Boulis; 20 Aug 2020, 05:28.

  • #2
    Can I include such a variable in a Cox regression and if so how do I interpret the result.
    With caution (details later), yes you can include such a variable in your model. The (standard) interpretation is that for each 1 year increase in the underlying variable (agediff), the hazard is multiplied by 0.9567. That some values are negative doesn't matter. You could add any constant to agediff and the model fit would be unchanged.

    That is, the event rate is lower in individuals where the male is older than his partner. It's assumed that the effect of agediff is linear along with the other assumptions inherent in a Cox model. If the male is 5 years older than the HR is 0.9566857^5.

    The reason for caution is that I don't fully understand your code (I can't see agediff in the first formula but am assuming it's equivalent to wanted), your study design (I assume the data contains only one observation for each couple), or your research question. You write "I want to test if gender has an effect with respect to age differences in relationships" but it's not obvious your model allows you to do that since you don't mention what the outcome is.

    Comment


    • #3
      Adding to Paul's excellent feedback, the way I always interpret hazard ratios is taking the absolute difference between the hazard ratio and 1. In your case, this means every one year a male is older than a female corresponds to a 0.0433 decrease in the failure of a relationship.

      Comment


      • #4
        Thank you Paul Dickman and Chris Boudreaux for your responses. That helps a lot. According to #2, the risk of dissolution of a relationship where the male is one year older than their female partner is 0.9567, and if the male is 5 years older than their partner this will reduce the risk to about 0.801 or about 80% that of a couple of the same age. Regarding Chris' explanation in #3, I'd reduce the hazard by 4.33% for every year that the male is older than the female. If it's five years, then that couple will have a hazard about 20% lower than a couple of the same age. Is there a way to interpret the effect if the female is older than the male? Would it be 104.33 times for each year that a female is older than a male? (Based on the results in #1, this variable is not significant - correct?).

        Yes, 'wanted' is 'agediff' (by gender). I am using panel data (18 waves) so there would be one 'agediff' value per couple per wave. My research question tests the effect of a couple of key variables, and a number of control variables on the dissolution of relationships, 'agediff' is one of my control variables. I hope that answers your question.

        Thank you again, kind regards, Chris

        Comment


        • #5
          Thanks for the additional information. Much clearer now.

          the risk of dissolution of a relationship where the male is one year older than their female partner is 0.9567, and if the male is 5 years older than their partner this will reduce the risk to about 0.801 or about 80% that of a couple of the same age.
          Not quite. I think you understand, but your statement is not correct (apart from the last clause) since it refers to the absolute risk of dissolution. The estimates refer to the relative risk. Based on the fitted model, the rate of dissolution (whatever it may be) is 4.33 percent lower if the male is 1 year older than his partner. It's assumed to be 4.33% lower for each and every year the male is older than his female partner (so a 5 year difference corresponds to the risk being 20% lower than couples of the same age or 80% of the risk as you write).

          Is there a way to interpret the effect if the female is older than the male?
          The HR is 1/0.9567 = 1.045 so the risk of dissolution is 4.5% higher for every year the woman is older than the male.

          Compared to same-age couples;

          The risk of dissolution is 20% lower if the male if 5 years older than the female.
          The risk of dissolution is 25% higher if the female if 5 years older than the male.

          It's assumed that these effects are the same for all levels of your other variables and for every point in time (length of relationship).

          You could, if you chose, categorise agediff. For example:

          1: male 6 or more years older
          2: male 3-6 years older
          3: age difference less than 3 years
          4: female 3-6 years older
          5: female 6 or more years older

          Then put ib3.varname in the model. ib3 specified that category 3 is the reference so you'll get hazard ratios for each of the other categories compared to the reference.

          Comment


          • #6
            Hi Paul Dickman. Thank you for your comments. I'm having difficulties coding the categories you suggested and I think it is because I'm not coding the negative values correctly. Note that agediff is agediff (by gender as discussed above):
            Code:
            gen agediff2 = 1 if agediff > 6 & agediff < .
            replace agediff2 = 2 if inrange(agediff, 3, 6) & agediff < .
            replace agediff2 = 3 if inrange(agediff, 2, -2) & agediff < .
            replace agediff2 = 4 if inrange(agediff, -3, -6) & agediff < .
            replace agediff2 = 5 if agediff > -6 & agediff < .
            I also tried replacing inrange with inlist for lines 2-4.
            Code:
            replace agediff2 = 2 if inlist(agediff, 3, 4, 5, 6) & agediff < .
            replace agediff2 = 3 if inrange(agediff, 2, 1, 0, -1, -2) & agediff < .
            replace agediff2 = 4 if inrange(agediff, -3, -4, -5, -6) & agediff < .
            without any change (as one should expect), which makes me think my code is not dealing with negative values appropriately. Do you have any suggestions? Kind regards, Chris

            Comment


            • #7
              When using inrange(), you need to specify the lower value followed by the higher value, that is, inrange(agediff, -2, 2) rather than inrange(agediff, 2, -2). You also need < -6 in the final category (not > -6).

              As a matter of style, I prefer !missing(agediff) rather than "agediff < ." as I think it makes the code easier to read.

              I would use egen to create the categories. Here's an example where I generated some data and applied both the (corrected) inrange approach and the egen approach.

              Code:
              clear
              set seed 123456
              set obs 1000
              generate agediff=rnormal(0,3)
              summarize agediff
              
              gen agediff2 = 1 if agediff > 6 & agediff < .
              replace agediff2 = 2 if inrange(agediff, 3, 6) & agediff < .
              replace agediff2 = 3 if inrange(agediff, -3, 3) & agediff < .
              replace agediff2 = 4 if inrange(agediff, -6, -3) & agediff < .
              replace agediff2 = 5 if agediff < -6 & agediff < .
              
              egen agediff3 = cut(agediff) if !missing(agediff), at(-99 -6 -3 3 6 99) icodes
              
              tab agediff2 agediff3
              Here's the final output

              Code:
                         |                        agediff3
                agediff2 |         0          1          2          3          4 |     Total
              -----------+-------------------------------------------------------+----------
                       1 |         0          0          0          0         27 |        27 
                       2 |         0          0          0        126          0 |       126 
                       3 |         0          0        691          0          0 |       691 
                       4 |         0        133          0          0          0 |       133 
                       5 |        23          0          0          0          0 |        23 
              -----------+-------------------------------------------------------+----------
                   Total |        23        133        691        126         27 |     1,000
              The icodes option to egen uses 0 as the first category; you could add 1 to all values if you wanted the first category to be 1.

              Comment


              • #8
                Thank you Paul Dickman. I feel better knowing my code was close :-) thanks for the comments regarding negative values (where I was tripped up). I wonder why you included -3, 3 in the third line of code for 'agediff2' when the rule in #5 (to apply to lines 2 & 4) is 'less than 3'. Will that not lead to duplication?

                Wow -egen- makes the code much simpler, I need to learn more about that function as I can see it will save a lot of time. Can you clarify why you included (-99, 99) in your -egen- alternative? Why not just include (-6, -3, 3, 6). Given #5, is it not more correct to use (-6, -2 2, 6) or if if -99, 9 are needed, (-99, -6, -2, 2, 6, 99)? Kind regards, Chris

                Comment


                • #9
                  Can you clarify why you included (-99, 99) in your -egen- alternative?
                  As it say in the help, "newvar is set to missing for observations with varname less than the first number specified in at() and for observations with varname greater than or equal to the last number specified in at()."

                  Given #5, is it not more correct to use (-6, -2 2, 6) or if if -99, 9 are needed, (-99, -6, -2, 2, 6, 99)?
                  Good question. I should have been more precise here. Whenever splitting or categorising one should always be aware of what happens on the boundaries. This command (and most other Stata commands that perform this type of operation) create intervals that are closed on the left and open on the right. Here's an illustration:

                  Code:
                  // generate a data set where agediff takes 
                  // integer values from -10 to 10
                  clear
                  set obs 21
                  generate agediff = _n-11
                  
                  egen agediff2 = cut(agediff), at(-6 -3 3 6) icodes
                  egen agediff3 = cut(agediff), at(-99 -6 -3 3 6 99) icodes
                  
                  list, clean
                  Gives the following:

                  Code:
                  . list, clean
                  
                         agediff   agediff2   agediff3  
                    1.       -10          .          0  
                    2.        -9          .          0  
                    3.        -8          .          0  
                    4.        -7          .          0  
                    5.        -6          0          1  
                    6.        -5          0          1  
                    7.        -4          0          1  
                    8.        -3          1          2  
                    9.        -2          1          2  
                   10.        -1          1          2  
                   11.         0          1          2  
                   12.         1          1          2  
                   13.         2          1          2  
                   14.         3          2          3  
                   15.         4          2          3  
                   16.         5          2          3  
                   17.         6          .          4  
                   18.         7          .          4  
                   19.         8          .          4  
                   20.         9          .          4  
                   21.        10          .          4
                  Here you can see the effect of not including the limiting endpoints (-99, 99). I would suggest choosing endpoints such that any larger value would be expected to be an error (an age difference greater than 25?).

                  The interval between 3 and 6 includes 3, but not 6. Mathematicians write this as [3,6) (square brackets denote closed and parentheses denote open).

                  In my example agediff contained real numbers (as opposed to integers) so with "at(-99 -6 -3 3 6 99)" the middle category (in practice) captured all values where the age difference was less than 3.

                  If your agediff only contains integer values then you probably want the following in order to be consistent with #5. Bottom line is to be precise in how you define the categories and write the code accordingly.

                  Code:
                  . egen agediff2 = cut(agediff), at(-99 -5 -2 3 6 99) icodes
                  
                  . 
                  . list, clean
                  
                         agediff   agediff2  
                    1.       -10          0  
                    2.        -9          0  
                    3.        -8          0  
                    4.        -7          0  
                    5.        -6          0  
                    6.        -5          1  
                    7.        -4          1  
                    8.        -3          1  
                    9.        -2          2  
                   10.        -1          2  
                   11.         0          2  
                   12.         1          2  
                   13.         2          2  
                   14.         3          3  
                   15.         4          3  
                   16.         5          3  
                   17.         6          4  
                   18.         7          4  
                   19.         8          4  
                   20.         9          4  
                   21.        10          4
                  Could reformulate the definitions in #5 as follows to improve clarity:

                  0: female 6 or more years older
                  1: female 3,4, or 5 years older
                  2: age difference less than 3 years
                  3: male 3,4, or 5 years older
                  4: male 6 or more years older

                  Comment


                  • #10
                    Thank you very much Paul Dickman. Very clear explanation. In the at() function of -egen- I add 'capped' values to guide the first and last groupings, you used 99, but I could use any value. I will apply 25 as you suggested - only 0.04% of my sample is outside this value.

                    To confirm my understanding, when working with negative values in -egen-, I include the value below the start of my range, however, with positive numbers, I include the value from which my range starts (the previous value to this will form the end value of the previous range, e.g.(based on -egen- code in #9) -6 = values from -25 to -6; -2 = values -5, -4, -3. A positive 3 refers to all values from the last value (-2) up to the preceding value (-2, -1, 0, 1, 2), and 6 captures all values from 3 to 5 (3, 4, 5). Lastly, 25 refers to values >=6 to 25.

                    . tab agediff2 and agediff3
                    Code:
                                          |                        agediff3
                                 agediff2 | [0] femal  [1] femal  [2] age d  [3] male>  [4] male> 
                    ----------------------+-------------------------------------------------------
                    [0] female>male >= 6  |     3,072          0          0          0          0 
                    [1] female>male 3-5 y |         0      4,742          0          0          0 
                    [2] age diff < 3 year |         0          0     42,476          0          0 
                    [3] male>female 3-5 y |         0          0          0     20,902          0 
                    [4] male>female >= 6  |         0          0          0          0     17,090
                    ----------------------+-------------------------------------------------------
                                    Total |     3,072      4,742     42,476     20,902     17,090
                    . list agediff2 agediff3
                    Code:
                                             agediff2                     agediff3  
                      1.       [2] age diff < 3 years       [2] age diff < 3 years  
                      2.       [2] age diff < 3 years       [2] age diff < 3 years  
                      3.       [2] age diff < 3 years       [2] age diff < 3 years  
                      4.       [2] age diff < 3 years       [2] age diff < 3 years  
                      5.   [4] male>female >= 6 years   [4] male>female >= 6 years  
                      6.   [4] male>female >= 6 years   [4] male>female >= 6 years  
                      7.   [4] male>female >= 6 years   [4] male>female >= 6 years  
                      8.   [4] male>female >= 6 years   [4] male>female >= 6 years  
                     24.       [2] age diff < 3 years       [2] age diff < 3 years  
                     25.       [2] age diff < 3 years       [2] age diff < 3 years  
                     26.       [2] age diff < 3 years       [2] age diff < 3 years
                    Adding i.agediff3 appears not to be significant in my model
                    Code:
                    -----------------------------------------------------------------------------------------
                                         _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
                    ------------------------+----------------------------------------------------------------
                                   agediff3 |
                    [1] female>male 3-5 ..  |   1.513371   .5885467     1.07   0.287     .7061792    3.243216
                    [2] age diff < 3 years  |   1.162385   .4569099     0.38   0.702     .5379725     2.51154
                    [3] male>female 3-5 ..  |   1.125298    .518667     0.26   0.798     .4559694     2.77715
                    [4] male>female >= 6..  |   1.685616   .9647151     0.91   0.362     .5490314    5.175115

                    Comment


                    • #11
                      Yes, i.agediff3 appears not to be significant but to formally test this you should test that all 4 parameters are jointly zero using, for example, the test or lrtest commands.

                      I suggest using the "diff < 3 years" category as the reference (use ib2.agediff3 instead of i.agediff3).

                      I suggest turning on showbaselevels, especially when posting on a public forum where readers aren't familiar with your data and don't have access to it.

                      I think it should be the default. You can make it permanent with:

                      Code:
                      set showbaselevels on, permanently

                      Comment


                      • #12
                        Very helpful thank you Paul Dickman. My mistake, I realised I forgot to use ib2.agediff3 after posting, here are the results for that below with "set showbaselevels on" as suggested:
                        Code:
                                                       _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]  ----------------------------------+----------------------------------------------------------------
                                                 agediff3 |
                             [0] female>male >= 6 years   |   .9716035   .3054677    -0.09   0.927     .5246566    1.799297
                               [1] female>male 3-5 years  |   1.376661   .3109827     1.42   0.157     .8841862    2.143435
                                  [2] age diff < 3 years  |          1  (base)
                               [3] male>female 3-5 years  |    .925072   .1401981    -0.51   0.607     .6873419    1.245026
                              [4] male>female >= 6 years  |   1.282837   .1894952     1.69   0.092     .9603638    1.713592
                        I'm not yet familiar with -lrtest-, so I used -test-
                        Code:
                        . testparm i(0/4).agediff3
                        
                        H0: All 4 parameters are jointly zero
                        
                         ( 1)  0.agediff3 = 0
                         ( 2)  1.agediff3 = 0
                         ( 3)  3.agediff3 = 0
                         ( 4)  4.agediff3 = 0
                        
                                   chi2(  4) =    5.80
                                 Prob > chi2 =    0.2147
                        I'm not sure how to interpret the results, so I appreciate your help with that. Kind regards, Chris

                        Comment


                        • #13
                          There is no evidence of a statistically significant association (p=0.2) between the inter-couple age difference (classified in 5 categories) and risk of dissolution.

                          Neither do we see, from the estimated hazard ratios, evidence of any pattern suggesting an association.

                          It's possible the HRs could have been something like 1.8, 1.4, 1, 1.4, 1.8 (age differences in either direction increase risk) but that would be missed when modelling agediff as a linear effect.

                          Comment


                          • #14
                            Hi Paul Dickman. Interestingly, adding age as a continuous variable for the respondent (hgage) and their partner (p_hgage) yields significant results
                            Code:
                                                        hgage |    .931602   .0249011    -2.65   0.008     .8840532    .9817082
                                                      p_hgage |   1.054282   .0281721     1.98   0.048     1.000487     1.11097
                            Age proves to be more important when combined with union type:
                            Code:
                                                     ageunion |
                                  [1] de facto both under 25  |          1  (base)
                                   [2] married both under 25  |   1.05e-14   4.37e-08    -0.00   1.000            0           .
                                       [3] de facto both 25+  |   .5424278   .1346233    -2.46   0.014     .3334918    .8822642
                                        [4] married both 25+  |   .1645743   .0403909    -7.35   0.000     .1017313    .2662376
                            Even moreso when looking at the age of the female:
                            Code:
                                [1] female de facto under 25  |          1  (base)
                                 [2] female married under 25  |   .2316685   .1218652    -2.78   0.005     .0826242    .6495711
                                     [3] female de facto 25+  |   .4584185   .0837781    -4.27   0.000     .3204064     .655878
                                      [4] female married 25+  |   .1333624   .0238447   -11.27   0.000      .093938    .1893326
                            Last edited by Chris Boulis; 26 Aug 2020, 22:33.

                            Comment

                            Working...
                            X