Help interpreting the hazard ratio of a variable with positive and negative values in a Cox regression model

Chris Boulis

Join Date: Feb 2019

Posts: 368
#1

Help interpreting the hazard ratio of a variable with positive and negative values in a Cox regression model

20 Aug 2020, 05:25

Hi Statalist.

I created a variable that measures 'Age Difference by Gender' within male and female couples as I want to test if gender has an effect with respect to age differences in relationships (I've included the used code below [legend] 'hgsex' & 'hgage' is the sex and age of the respondent and 'p_hgsex' & 'p_hgage' is the sex and age of their partner):

Code:

gen wanted = cond(hgsex == 1 , hgage - p_hgage, cond(hgsex == 2, p_hgage - hgage, .))

As such, this variable contains both positive values (the number of years that the male is older than the female) and negative values (the number of years that the female is older than the male). I added it to my Cox regression model as an indicator variable (i.agediff), however, Stata noted a variable cannot contain negative values, so I changed it to a continuous variable (c.agediff) and received the following output:

Code:

t Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval] agediff .9566857 .0552811 -0.77 0.443 .8542469 1.071408

Can I include such a variable in a Cox regression and if so how do I interpret the result. (This has been reposted from https://www.statalist.org/forums/for...er-stata/page2, as it was at odds with the original thread topic).

Last edited by Chris Boulis; 20 Aug 2020, 05:28.
Tags: None
Paul Dickman

Join Date: Apr 2014

Posts: 294
#2

20 Aug 2020, 08:09

Can I include such a variable in a Cox regression and if so how do I interpret the result.

With caution (details later), yes you can include such a variable in your model. The (standard) interpretation is that for each 1 year increase in the underlying variable (agediff), the hazard is multiplied by 0.9567. That some values are negative doesn't matter. You could add any constant to agediff and the model fit would be unchanged.

That is, the event rate is lower in individuals where the male is older than his partner. It's assumed that the effect of agediff is linear along with the other assumptions inherent in a Cox model. If the male is 5 years older than the HR is 0.9566857^5.

The reason for caution is that I don't fully understand your code (I can't see agediff in the first formula but am assuming it's equivalent to wanted), your study design (I assume the data contains only one observation for each couple), or your research question. You write "I want to test if gender has an effect with respect to age differences in relationships" but it's not obvious your model allows you to do that since you don't mention what the outcome is.
1 like
Comment
Chris Boudreaux

Join Date: Jul 2020

Posts: 83
#3

20 Aug 2020, 08:47

Adding to Paul's excellent feedback, the way I always interpret hazard ratios is taking the absolute difference between the hazard ratio and 1. In your case, this means every one year a male is older than a female corresponds to a 0.0433 decrease in the failure of a relationship.
1 like
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#4

20 Aug 2020, 23:29

Thank you Paul Dickman and Chris Boudreaux for your responses. That helps a lot. According to #2, the risk of dissolution of a relationship where the male is one year older than their female partner is 0.9567, and if the male is 5 years older than their partner this will reduce the risk to about 0.801 or about 80% that of a couple of the same age. Regarding Chris' explanation in #3, I'd reduce the hazard by 4.33% for every year that the male is older than the female. If it's five years, then that couple will have a hazard about 20% lower than a couple of the same age. Is there a way to interpret the effect if the female is older than the male? Would it be 104.33 times for each year that a female is older than a male? (Based on the results in #1, this variable is not significant - correct?).

Yes, 'wanted' is 'agediff' (by gender). I am using panel data (18 waves) so there would be one 'agediff' value per couple per wave. My research question tests the effect of a couple of key variables, and a number of control variables on the dissolution of relationships, 'agediff' is one of my control variables. I hope that answers your question.

Thank you again, kind regards, Chris
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#5

21 Aug 2020, 01:20

Thanks for the additional information. Much clearer now.

the risk of dissolution of a relationship where the male is one year older than their female partner is 0.9567, and if the male is 5 years older than their partner this will reduce the risk to about 0.801 or about 80% that of a couple of the same age.

Not quite. I think you understand, but your statement is not correct (apart from the last clause) since it refers to the absolute risk of dissolution. The estimates refer to the relative risk. Based on the fitted model, the rate of dissolution (whatever it may be) is 4.33 percent lower if the male is 1 year older than his partner. It's assumed to be 4.33% lower for each and every year the male is older than his female partner (so a 5 year difference corresponds to the risk being 20% lower than couples of the same age or 80% of the risk as you write).

Is there a way to interpret the effect if the female is older than the male?

The HR is 1/0.9567 = 1.045 so the risk of dissolution is 4.5% higher for every year the woman is older than the male.

Compared to same-age couples;

The risk of dissolution is 20% lower if the male if 5 years older than the female.
The risk of dissolution is 25% higher if the female if 5 years older than the male.

It's assumed that these effects are the same for all levels of your other variables and for every point in time (length of relationship).

You could, if you chose, categorise agediff. For example:

1: male 6 or more years older
2: male 3-6 years older
3: age difference less than 3 years
4: female 3-6 years older
5: female 6 or more years older

Then put ib3.varname in the model. ib3 specified that category 3 is the reference so you'll get hazard ratios for each of the other categories compared to the reference.
2 likes
Comment
Chris Boulis

Join Date: Feb 2019

Posts: 368
#6

22 Aug 2020, 21:09

Hi Paul Dickman. Thank you for your comments. I'm having difficulties coding the categories you suggested and I think it is because I'm not coding the negative values correctly. Note that agediff is agediff (by gender as discussed above):

Code:

gen agediff2 = 1 if agediff > 6 & agediff < . replace agediff2 = 2 if inrange(agediff, 3, 6) & agediff < . replace agediff2 = 3 if inrange(agediff, 2, -2) & agediff < . replace agediff2 = 4 if inrange(agediff, -3, -6) & agediff < . replace agediff2 = 5 if agediff > -6 & agediff < .

I also tried replacing inrange with inlist for lines 2-4.

Code:

replace agediff2 = 2 if inlist(agediff, 3, 4, 5, 6) & agediff < . replace agediff2 = 3 if inrange(agediff, 2, 1, 0, -1, -2) & agediff < . replace agediff2 = 4 if inrange(agediff, -3, -4, -5, -6) & agediff < .

without any change (as one should expect), which makes me think my code is not dealing with negative values appropriately. Do you have any suggestions? Kind regards, Chris
Comment

Paul Dickman

Join Date: Apr 2014
Posts: 294

23 Aug 2020, 11:59

When using inrange(), you need to specify the lower value followed by the higher value, that is, inrange(agediff, -2, 2) rather than inrange(agediff, 2, -2). You also need < -6 in the final category (not > -6).

As a matter of style, I prefer !missing(agediff) rather than "agediff < ." as I think it makes the code easier to read.

I would use egen to create the categories. Here's an example where I generated some data and applied both the (corrected) inrange approach and the egen approach.

Code:

clear
set seed 123456
set obs 1000
generate agediff=rnormal(0,3)
summarize agediff

gen agediff2 = 1 if agediff > 6 & agediff < .
replace agediff2 = 2 if inrange(agediff, 3, 6) & agediff < .
replace agediff2 = 3 if inrange(agediff, -3, 3) & agediff < .
replace agediff2 = 4 if inrange(agediff, -6, -3) & agediff < .
replace agediff2 = 5 if agediff < -6 & agediff < .

egen agediff3 = cut(agediff) if !missing(agediff), at(-99 -6 -3 3 6 99) icodes

tab agediff2 agediff3

Here's the final output

Code:

           |                        agediff3
  agediff2 |         0          1          2          3          4 |     Total
-----------+-------------------------------------------------------+----------
         1 |         0          0          0          0         27 |        27 
         2 |         0          0          0        126          0 |       126 
         3 |         0          0        691          0          0 |       691 
         4 |         0        133          0          0          0 |       133 
         5 |        23          0          0          0          0 |        23 
-----------+-------------------------------------------------------+----------
     Total |        23        133        691        126         27 |     1,000

The icodes option to egen uses 0 as the first category; you could add 1 to all values if you wanted the first category to be 1.

Comment

Chris Boulis

Join Date: Feb 2019

Posts: 368
#8

25 Aug 2020, 00:14

Thank you Paul Dickman. I feel better knowing my code was close :-) thanks for the comments regarding negative values (where I was tripped up). I wonder why you included -3, 3 in the third line of code for 'agediff2' when the rule in #5 (to apply to lines 2 & 4) is 'less than 3'. Will that not lead to duplication?

Wow -egen- makes the code much simpler, I need to learn more about that function as I can see it will save a lot of time. Can you clarify why you included (-99, 99) in your -egen- alternative? Why not just include (-6, -3, 3, 6). Given #5, is it not more correct to use (-6, -2 2, 6) or if if -99, 9 are needed, (-99, -6, -2, 2, 6, 99)? Kind regards, Chris
Comment
Paul Dickman

Join Date: Apr 2014

Posts: 294
#9

25 Aug 2020, 01:26

Can you clarify why you included (-99, 99) in your -egen- alternative?

As it say in the help, "newvar is set to missing for observations with varname less than the first number specified in at() and for observations with varname greater than or equal to the last number specified in at()."

Given #5, is it not more correct to use (-6, -2 2, 6) or if if -99, 9 are needed, (-99, -6, -2, 2, 6, 99)?

Good question. I should have been more precise here. Whenever splitting or categorising one should always be aware of what happens on the boundaries. This command (and most other Stata commands that perform this type of operation) create intervals that are closed on the left and open on the right. Here's an illustration:

Code:

// generate a data set where agediff takes // integer values from -10 to 10 clear set obs 21 generate agediff = _n-11 egen agediff2 = cut(agediff), at(-6 -3 3 6) icodes egen agediff3 = cut(agediff), at(-99 -6 -3 3 6 99) icodes list, clean

Gives the following:

Code:

. list, clean agediff agediff2 agediff3 1. -10 . 0 2. -9 . 0 3. -8 . 0 4. -7 . 0 5. -6 0 1 6. -5 0 1 7. -4 0 1 8. -3 1 2 9. -2 1 2 10. -1 1 2 11. 0 1 2 12. 1 1 2 13. 2 1 2 14. 3 2 3 15. 4 2 3 16. 5 2 3 17. 6 . 4 18. 7 . 4 19. 8 . 4 20. 9 . 4 21. 10 . 4

Here you can see the effect of not including the limiting endpoints (-99, 99). I would suggest choosing endpoints such that any larger value would be expected to be an error (an age difference greater than 25?).

The interval between 3 and 6 includes 3, but not 6. Mathematicians write this as [3,6) (square brackets denote closed and parentheses denote open).

In my example agediff contained real numbers (as opposed to integers) so with "at(-99 -6 -3 3 6 99)" the middle category (in practice) captured all values where the age difference was less than 3.

If your agediff only contains integer values then you probably want the following in order to be consistent with #5. Bottom line is to be precise in how you define the categories and write the code accordingly.

Code:

. egen agediff2 = cut(agediff), at(-99 -5 -2 3 6 99) icodes . . list, clean agediff agediff2 1. -10 0 2. -9 0 3. -8 0 4. -7 0 5. -6 0 6. -5 1 7. -4 1 8. -3 1 9. -2 2 10. -1 2 11. 0 2 12. 1 2 13. 2 2 14. 3 3 15. 4 3 16. 5 3 17. 6 4 18. 7 4 19. 8 4 20. 9 4 21. 10 4

Could reformulate the definitions in #5 as follows to improve clarity:

0: female 6 or more years older
1: female 3,4, or 5 years older
2: age difference less than 3 years
3: male 3,4, or 5 years older
4: male 6 or more years older
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

#10

25 Aug 2020, 21:12

Thank you very much Paul Dickman. Very clear explanation. In the at() function of -egen- I add 'capped' values to guide the first and last groupings, you used 99, but I could use any value. I will apply 25 as you suggested - only 0.04% of my sample is outside this value.

To confirm my understanding, when working with negative values in -egen-, I include the value below the start of my range, however, with positive numbers, I include the value from which my range starts (the previous value to this will form the end value of the previous range, e.g.(based on -egen- code in #9) -6 = values from -25 to -6; -2 = values -5, -4, -3. A positive 3 refers to all values from the last value (-2) up to the preceding value (-2, -1, 0, 1, 2), and 6 captures all values from 3 to 5 (3, 4, 5). Lastly, 25 refers to values >=6 to 25.

. tab agediff2 and agediff3

Code:

                      |                        agediff3
             agediff2 | [0] femal  [1] femal  [2] age d  [3] male>  [4] male> 
----------------------+-------------------------------------------------------
[0] female>male >= 6  |     3,072          0          0          0          0 
[1] female>male 3-5 y |         0      4,742          0          0          0 
[2] age diff < 3 year |         0          0     42,476          0          0 
[3] male>female 3-5 y |         0          0          0     20,902          0 
[4] male>female >= 6  |         0          0          0          0     17,090
----------------------+-------------------------------------------------------
                Total |     3,072      4,742     42,476     20,902     17,090

. list agediff2 agediff3

Code:

                         agediff2                     agediff3  
  1.       [2] age diff < 3 years       [2] age diff < 3 years  
  2.       [2] age diff < 3 years       [2] age diff < 3 years  
  3.       [2] age diff < 3 years       [2] age diff < 3 years  
  4.       [2] age diff < 3 years       [2] age diff < 3 years  
  5.   [4] male>female >= 6 years   [4] male>female >= 6 years  
  6.   [4] male>female >= 6 years   [4] male>female >= 6 years  
  7.   [4] male>female >= 6 years   [4] male>female >= 6 years  
  8.   [4] male>female >= 6 years   [4] male>female >= 6 years  
 24.       [2] age diff < 3 years       [2] age diff < 3 years  
 25.       [2] age diff < 3 years       [2] age diff < 3 years  
 26.       [2] age diff < 3 years       [2] age diff < 3 years

Adding i.agediff3 appears not to be significant in my model

Code:

-----------------------------------------------------------------------------------------
                     _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------------+----------------------------------------------------------------
               agediff3 |
[1] female>male 3-5 ..  |   1.513371   .5885467     1.07   0.287     .7061792    3.243216
[2] age diff < 3 years  |   1.162385   .4569099     0.38   0.702     .5379725     2.51154
[3] male>female 3-5 ..  |   1.125298    .518667     0.26   0.798     .4559694     2.77715
[4] male>female >= 6..  |   1.685616   .9647151     0.91   0.362     .5490314    5.175115

Comment

Paul Dickman

Join Date: Apr 2014

Posts: 294
#11

26 Aug 2020, 00:02

Yes, i.agediff3 appears not to be significant but to formally test this you should test that all 4 parameters are jointly zero using, for example, the test or lrtest commands.

I suggest using the "diff < 3 years" category as the reference (use ib2.agediff3 instead of i.agediff3).

I suggest turning on showbaselevels, especially when posting on a public forum where readers aren't familiar with your data and don't have access to it.

I think it should be the default. You can make it permanent with:

Code:

set showbaselevels on, permanently
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

#12

26 Aug 2020, 04:56

Very helpful thank you Paul Dickman. My mistake, I realised I forgot to use ib2.agediff3 after posting, here are the results for that below with "set showbaselevels on" as suggested:

Code:

                               _t | Haz. Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]  ----------------------------------+----------------------------------------------------------------
                         agediff3 |
     [0] female>male >= 6 years   |   .9716035   .3054677    -0.09   0.927     .5246566    1.799297
       [1] female>male 3-5 years  |   1.376661   .3109827     1.42   0.157     .8841862    2.143435
          [2] age diff < 3 years  |          1  (base)
       [3] male>female 3-5 years  |    .925072   .1401981    -0.51   0.607     .6873419    1.245026
      [4] male>female >= 6 years  |   1.282837   .1894952     1.69   0.092     .9603638    1.713592

I'm not yet familiar with -lrtest-, so I used -test-

Code:

. testparm i(0/4).agediff3

H0: All 4 parameters are jointly zero

 ( 1)  0.agediff3 = 0
 ( 2)  1.agediff3 = 0
 ( 3)  3.agediff3 = 0
 ( 4)  4.agediff3 = 0

           chi2(  4) =    5.80
         Prob > chi2 =    0.2147

I'm not sure how to interpret the results, so I appreciate your help with that. Kind regards, Chris

Comment

Paul Dickman

Join Date: Apr 2014

Posts: 294
#13

26 Aug 2020, 06:54

There is no evidence of a statistically significant association (p=0.2) between the inter-couple age difference (classified in 5 categories) and risk of dissolution.

Neither do we see, from the estimated hazard ratios, evidence of any pattern suggesting an association.

It's possible the HRs could have been something like 1.8, 1.4, 1, 1.4, 1.8 (age differences in either direction increase risk) but that would be missed when modelling agediff as a linear effect.
1 like
Comment

Chris Boulis

Join Date: Feb 2019
Posts: 368

#14

26 Aug 2020, 22:22

Hi Paul Dickman. Interestingly, adding age as a continuous variable for the respondent (hgage) and their partner (p_hgage) yields significant results

Code:

                            hgage |    .931602   .0249011    -2.65   0.008     .8840532    .9817082
                          p_hgage |   1.054282   .0281721     1.98   0.048     1.000487     1.11097

Age proves to be more important when combined with union type:

Code:

                         ageunion |
      [1] de facto both under 25  |          1  (base)
       [2] married both under 25  |   1.05e-14   4.37e-08    -0.00   1.000            0           .
           [3] de facto both 25+  |   .5424278   .1346233    -2.46   0.014     .3334918    .8822642
            [4] married both 25+  |   .1645743   .0403909    -7.35   0.000     .1017313    .2662376

Even moreso when looking at the age of the female:

Code:

    [1] female de facto under 25  |          1  (base)
     [2] female married under 25  |   .2316685   .1218652    -2.78   0.005     .0826242    .6495711
         [3] female de facto 25+  |   .4584185   .0837781    -4.27   0.000     .3204064     .655878
          [4] female married 25+  |   .1333624   .0238447   -11.27   0.000      .093938    .1893326

Last edited by Chris Boulis; 26 Aug 2020, 22:33.

Announcement