Plotting restricted cubic-spline graph, stratified by another binary variable

Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#1

Plotting restricted cubic-spline graph, stratified by another binary variable

08 Jan 2016, 06:38

Dear Statalist members,

This topic is an extended version of another topic I had posted "How to plot a restricted cubic spline among 2 groups using a logistic regression model fitted on a case control data
with extra info (data set, codes, graph). I have closed that thread with a comment.
I am trying to plot a restricted cubic spline graph for a continuous variable stratified by another binary variable. My broad aim is visualization of interaction between these two variables. I am planing to do this by observing the change in curves among categories of the binary variable. Base data is case-control. If you have any better suggestion, you are most welcome to advice. I am giving below the details of variables used, the data set generated by dataex, syntax I have used in Stata 13, graph generated, and my 3 queries.

Variables: outcome - binary; ENdrink- binary, ever never variable for alcohol consumption; drinkL- main exposure alcohol in liters, continuous variable; drinkL_C- centered drinkL uisng the mean among alcohol consumers, i.e, ENdrink=1. Other covariates include X - binary; edu- continuous; smoke- continuous.

My dataset has 7 variables and 819 observations. I was not able to paste the code created using dataex here due to character restriction. I am pasting below an uploaded link of the same. I hope you can access this. if not, please suggest a way to upload the data.

Thekke purakkal_dataex code alcohol RCspline .docx

Syntax I have tried out

//FINDING KNOT POSITIONS AT 5, 50th AND 95th PERCENTILES OF drinkL_C
centile(drinkL_C) if ENdrink==1, centile(5 50 95)

// TRANSFORMING drinkL_C TO SPLINE FUNCTION WITH THE 3 KNOTS AT POSITIONS DERIVED ABOVE
//STORING THE ESTIMATES
mkspline2 RCS3 = drinkL_C, cubic knots( -609.0704 -375.2204 1578.77) displayknots
mat knots3 = r(knots)

// FITTING THE LOGISTIC REGRESSION MODEL USING THE SPLINE VARIABLES RCS3*. MODEL SHOULD CONTAIN THE ENdrink variable.
//STORING THE ESTIMATES
logistic outcome RCS3* edu smoke ENdrink
estimates store cknots3

// ESTIMATING POINTWISE ODDS RATIOS TO USE IN THE GRAPH. REFERENCE IS ZERO
xbrcspline RCS3, values(-614.5905(30)8210.63) matknots(knots3) ref(0) gen(drink3 or3 lb3 ub3)

// RUG PLOT TO SHOW OBSERVATION DENSITY - CREATING A VARIABLE WHICH DENOTES THE POSITION OF RUG PLOT IN THE GRAPH
// CREATING A SYMBOL FOR RUG PLOT
gen where = -3
gen pipe = "|"

// PLOTTING THE GRAPH - Linear(blue) and Restricted cubic spline(red) in the same graph. 0.0000879 USED IN THE CODE IS THE POINT ESTIMATE OF drinkL_C OBTAINED FROM LOGISTIC REGRESSION MODEL ASSUMING LINEARITY OF ALCOHOL (logistic outcome drinkL_C edu smoke ENdrink)

twoway (line lb3 ub3 or3 drink3, lp(- - l) lc(black black red)) || function y= (.0000831*x), ra(drink3) clpat(dash) lc(blue) lp(-) || scatter where drinkL_C, ms(none) mlabel(pipe) mlabpos(0)

My Queries are:
1) x- axis has centered life time consumption of alcohol. Y axis has odds ratios. but originally there are no drinkers with value below 0. So how can i transform this graph from 0 and above?
2)Most of the observations, as observed from the rug plot are concentrated below 5000L of alcohol. How can I restrict my x-axis and graph to around 5000 without dropping those observations? Also how to add axis titles
3) The above graph shows linear and RC spline in a single graph for the whole data set . Now I want to know how these curves behave among the 2 categories A and B of variable X, represented in the same graph, i.e, this graph stratified by variable X. What changes should I make in the above code for introduction of this variable X.

Thankyou for your consideration
Attached Files

Thekke purakkal_dataex code alcohol RCspline .docx (31.2 KB, 1 view)

Last edited by Thekke Purakkal; 08 Jan 2016, 06:42.
Tags: None

Roman Mostazir

Join Date: Apr 2014
Posts: 870

08 Jan 2016, 15:38

Originally posted by Thekke Purakkal View Post

My Queries are:
1) x- axis has centered life time consumption of alcohol. Y axis has odds ratios. but originally there are no drinkers with value below 0. So how can i transform this graph from 0 and above?

This claim perhaps is not right. You fitted a logistic regression and saved the coefficients. The logistic command gave you the odds ratio which is exponentiated in background before being presented. The original coefficient is stored in log form. You need to exponentiate the stored coefficients to have them in odds ratio. Forexample:

Code:

//Example data:

set obs 30
set seed 123
gen int y = (1-0+1)*runiform()+0
gen x = float((20-0.5)*runiform()+0.5)

list y x in 1/10,clean

     y          x  
  1.   1   3.608357  
  2.   0   11.02754  
  3.   0   10.06257  
  4.   0    8.66411  
  5.   1   12.24431  
  6.   1   9.571083  
  7.   1   1.933932  
  8.   1   8.319518  
  9.   0   15.03819  
 10.   0   11.51881  


logistic y x,

Logistic regression                               Number of obs   =         30
                                                  LR chi2(1)      =       0.45
                                                  Prob > chi2     =     0.5041
Log likelihood = -20.504586                       Pseudo R2       =     0.0108

------------------------------------------------------------------------------
           y | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .9554149   .0657805    -0.66   0.508      .834808    1.093446
       _cons |   1.365452    1.04193     0.41   0.683     .3060229    6.092553
------------------------------------------------------------------------------

est sto m1 //store the coefficient

esti table m1, //display the original coefficient which is actually stored in log form

---------------------------
    Variable |     m1      
-------------+-------------
           x | -.04560957  
       _cons |  .31148585  
---------------------------

esti table m1, eform //eform  option gives you the exponentiation of the log odds.

---------------------------
    Variable |     m1      
-------------+-------------
           x |  .95541492  
       _cons |  1.3654525  
---------------------------

Originally posted by Thekke Purakkal View Post

2)Most of the observations, as observed from the rug plot are concentrated below 5000L of alcohol. How can I restrict my x-axis and graph to around 5000 without dropping those observations? Also how to add axis titles

Defining specific xlabel should achieve this:

Code:

twoway (line lb3 ub3 or3 drink3, lp(- - l) lc(black black red)) || ///
function y= (.0000831*x), ra(drink3) clpat(dash) lc(blue) lp(-) ||  ///
scatter where drinkL_C, ms(none) mlabel(pipe) mlabpos(0) xlab(0 (2500) 5000)

Originally posted by Thekke Purakkal View Post

3) The above graph shows linear and RC spline in a single graph for the whole data set . Now I want to know how these curves behave among the 2 categories A and B of variable X, represented in the same graph, i.e, this graph stratified by variable X. What changes should I make in the above code for introduction of this variable X.

Use of by option should achieve this:

Code:

twoway (line lb3 ub3 or3 drink3, lp(- - l) lc(black black red)) || ///
function y= (.0000831*x), ra(drink3) clpat(dash) lc(blue) lp(-) ||  ///
scatter where drinkL_C, by(X) ms(none) mlabel(pipe) mlabpos(0)

Please use the code delimeters from the right hand corner side [A] option for providing examples of your data. Read the FAQ section-12 on how to post effectively to obain a useful answer.

Last edited by Roman Mostazir; 08 Jan 2016, 15:41.

Roman

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 870
#3

10 Jan 2016, 14:58

Apologies to have delayd in response. Glad to see you sorted out the problem in graphing. I don't quite follow the research process you have undertaken. For example, I don't get the reason of centering a variable (drinkL) only for the values of 1 for another variable (Enddrink). Neither I have the answer for the ncecessity to use splines here whereas use of a full logit model and use of Stata's fabuluous post estimation 'margins' command to plot the customized predications would have provided more flexibility. I am sure you have valid reasons for doing this but I can't figure out and therefore, will restrict myself commenting on what is better. My suggestions were simply mechanisitc. In regards to the odds ratio in #3, i) yes they are in log of the odds form. It is counter intuitive to grasp what the difference in log scale corresponds to in terms of original scale. A better picture will be the exponentiation form. After the 'xbrcspline' command, use of 'eform' should give you exponentiated form in probability scale. Have a look at 'xbrcsplice' help file for the 'eform' option. Rgarding the x-scale, no idea, as your scale for (DrinkLc) is centered for a sub sample rather the whole sample but you are making a prediction on the whole range. Sorry can't help as I am not getting your process. Hope someone else can help.

Roman
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#4

10 Jan 2016, 17:29

Thanks for getting back Roman. In the code I have used to solve my problem, I have not used a centered variable(drinkL is uncentered). However, the motivation for using the centered variable and indicator in my original query comes from the paper by Leffondre et al, Modeling Smoking History: A Comparison of Different Approaches,2002 (refer the sub heading: Including never smokers in page 817). It explains the advantage of using a centered continuous variable (e.g., drinkL_C) and the indicator (e.g., endrink, binary) in the model. By using the indicator, effect of alcohol in liters gets estimated only among drinkers (avoids non drinkers). Mean centering the continuous form, while keeping 0 for never drinkers, allows the effect of ever drinkers to compare average drinkers with never drinker, since both groups are assigned a value of 0 for centered drink in liters. They say that, without this transformation of continuous variable, the point estimates for ever drinking would be more difficult to interpret, as it would compare never drinkers and hypothetical drinkers with 0 liters of alcohol. What do you think about this?

My first try on this was the sue of stats margins command. from those plots I could really see how the probability of the outcome varies for specific levels of drinking in liters. Also I could see if there was a difference in the two groups of X and if the difference was significant or not. The probability scale also provided a cool non linear transformation.of my variable. However a criticism arose that by using margins command, I am plotting risks (probability) instead of odds ratios which cannot be done as mine is a 'case-control data'. The argument is that to plot the risk of binary outcome across the range of x variable, we need an estimate of the baseline risk, which is not available from a case-control data. Furthermore, when I compared linear models with spline models with various knot positions, the likelihood ratio test tells me my present non-linear RC spline model is better than other non linear and linear models.This is my reason for using RC splines. What's your opinion? (I had tried to get opinion on this in the topic "Looping (forvalues) to calculate pack-years of cigarette smoked" (some hw i am not able to place an html tag around this? it results in html code ).

Regarding the y axis to use odds ratio or log odds. having read multiple papers and arguments. Some papers have directly used odds ratios while most have used odds ratios in the log scale. A simple reasoning based on similar to what you mentioned in your comment (difference in log scale corresponds to in terms of original scale) is given in this link "https://andrewpwheeler.wordpress.com/2013/10/26/odds-ratios-need-to-be-graphed-on-log-scales/".

These topics are open for discussion.
Thanks

Last edited by Thekke Purakkal; 10 Jan 2016, 17:39.
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#5

11 Jan 2016, 02:44

Hi, So I received the answer of why a margins command cannot be used for a case-control data. Please refer this topic.
http://www.statalist.org/forums/foru...27#post1322427
Comment
Thekke Purakkal

Join Date: Dec 2015

Posts: 95
#6

11 Jan 2016, 04:10

I guess i made a mistake in the terminology used. By log odds i was meaning log of odds ratio. My bad. Essentially i need the y axis in the log scale
http://blogs.sas.com/content/iml/2015/07/29/or-plots-log-scale.html

Last edited by Thekke Purakkal; 11 Jan 2016, 04:33.
Comment

Announcement