Multiple linear regression using Splined data as separate models vs mkspline & combined model

Rephaim Mpofu

Join Date: Dec 2019

Posts: 4
#1

Multiple linear regression using Splined data as separate models vs mkspline & combined model

09 Mar 2022, 06:34

Hi Statalisters,

I have a dataset that I would like to fit a multiple linear regression using spline regression with mkspline.

My issue relates to reproducibility, validation, and probably understanding the "under the hood" mechanics of piecewise regression: I can run a simple linear regression using regress and inclusion of the splined data. I can then replicate the coefficients by running separate simple linear regressions and including if conditions to delineate which spline should be included as an independent variable. When I replicate this method in a mulitple linear regression however, my coefficients (and hence slopes) are different depending on whether I run separate regressions or one regression with multiple splines.

I would preferably like to replicate the results from the combined spline regression model by conducting separate regression models as confirmation that I am conducting the spline regression correctly.

I have included test code using the auto dataset for explanatory purposes which is largely based on the UCLA FAQ page titled "How can I run a piecewise regression in Stata":

Code:

sysuse auto graph twoway /// (scatter price mpg) /// (lowess price mpg) regress price mpg, vce(robust) //Method 1: simple regression with mkspline nl hockey price mpg local knotvalue = 26.24305 mkspline mpg_knot1 `knotvalue' mpg_knot2 = mpg regress price mpg_knot1 mpg_knot2 //Method 2: simple regression with separate regressions - no centering regress price mpg if mpg < `knotvalue' regress price mpg if mpg >= `knotvalue' //Method 3: simple regression with separate regressions and centering capture drop mpg_knot* gen mpg_knot2 = mpg - `knotvalue' regress price mpg_knot2 if mpg < `knotvalue' regress price mpg_knot2 if mpg >= `knotvalue' *Method 4: Multiple regression with mkspline in combined model capture drop mpg_knot* mkspline mpg_knot1 `knotvalue' mpg_knot2 = mpg regress price weight mpg_knot1 mpg_knot2 *Method 5: Multiple regression with separate regressions regress price weight mpg if mpg < `knotvalue' regress price weight mpg if mpg >= `knotvalue'

I have also included an excerpt of the data in question. The aim is to fit a spline mulitple regression on egfr_final using egfr_baseline, treatmentgroup, age category and presence of diabetes as covariates.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(pid age_cat) byte(treatmentgroup diabetes) int(egfr_baseline egfr_final) 189 3 1 0 85 111 136 3 2 0 110 117 120 4 1 0 117 123 25 3 2 0 127 96 141 2 2 0 137 154 115 3 1 0 110 109 266 3 2 0 74 78 74 3 1 0 111 98 58 4 1 0 118 101 134 2 2 0 127 129 end label values age_cat labelagecat label values treatmentgroup labeltreatmentgroup label def labeltreatmentgroup 1 "AmphoB", modify label def labeltreatmentgroup 2 "Fluconazole", modify label values diabetes labelyesno

I have consulted the mkspline pdf documentation, Michael Mitchell's textbook on interpreting and visualizing regression models using stata, and a few other resources that have been recommended on Statalist.

Stata version: Stata version 16.1 IC

Appreciate the assistance.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

09 Mar 2022, 10:29

Using the spline variables in the regression is not equivalent to running separate regressions on either side of the knot.

Splines do not, in fact, implement piecewise linear regression. Here's the difference. When you use a linear spline, the regression lines on either side of the knot are constrained to meet at the knot. In piecewise linear regression (or if you just regress separately on either side of the knot) the lines do not have to meet at the knot, and in most cases there will actually be a gap between them there.

If you want to emulate piecewise linear regression in a single regression command, you could do it like this:

Code:

local knotvalue 26.24305 gen byte bigger = mpg >= `knotvalue' regress price i.bigger##c.mpg margins bigger, dydx(mpg)

Here the i.bigger term provides the gap between the line segments to the left and right of the knot. (In the auto dataset, the gap happens to be very small, so this may not be the best example to use, but the principle applies anyway.)
Comment

Rephaim Mpofu

Join Date: Dec 2019
Posts: 4

09 Mar 2022, 22:09

Thanks Clyde for this response.

Perhaps these are simply differences in terminology or else possibly one of the largest misconceptions in statistics, but many references refer to these types of regressions as being the same. Mitchell's text book on interpreting and visualizing regressions says the following

"A piecewise regression goes by several names, including spline regression, broken line regression, broken stick regression, and even hockey stick models."

A pdf document titled Nonlinear relationships by Richard Williams, frequent contributor on statalist, also appears to allude to these techniques being similar, and I think he may also have referenced a similar technique you used here in the previous example as "switching regression", only difference being that he used marginal differences between subsequent splines using the marginal option on mkspline.

Is it possible that piecewise and spline regression can be similar, but spline regression data are a subtype where the data are always constrained to their knots, while piecewise regression's use discontinuous data that have jumps by including distinct intercepts/jumps?

Using your example above, I obtain the following output:

Code:

regress price i.knot2##c.mpg

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(3, 70)        =      9.16
       Model |   179048954         3  59682984.7   Prob > F        =    0.0000
    Residual |   456016442        70   6514520.6   R-squared       =    0.2819
-------------+----------------------------------   Adj R-squared   =    0.2512
       Total |   635065396        73  8699525.97   Root MSE        =    2552.4

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     1.knot2 |  -11992.34   6521.374    -1.84   0.070    -24998.81    1014.134
         mpg |  -388.0199   86.54718    -4.48   0.000    -560.6328    -215.407
             |
 knot2#c.mpg |
          1  |   456.9718   215.0009     2.13   0.037     28.16598    885.7777
             |
       _cons |    14056.5   1716.008     8.19   0.000     10634.03    17478.97
------------------------------------------------------------------------------

margins knot2, dydx(mpg)

Average marginal effects                        Number of obs     =         74
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/dx w.r.t. : mpg

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg          |
       knot2 |
          0  |  -388.0199   86.54718    -4.48   0.000    -560.6328    -215.407
          1  |   68.95189   196.8121     0.35   0.727    -323.5774    461.4812
------------------------------------------------------------------------------

Do the dy/dx values refer to the coefficients/slopes of their separate splines? Would I need to perform a separate hypothesis test to assess whether the two splines are significantly different, or do the interaction term results from the regress output (knot2#c.mpg) provide that information for me?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

10 Mar 2022, 10:33

Statistical terminology is not always used as consistently as it should be. The terms "broken stick" or "hockey stick" regression make sense as synonyms for spline regression because a broken stick or hockey stick still has a connection at the change in slope. But the term piecewise linear is a mathematical term whose use predates the birth of modern statistics, and it should be respected. A piecewise linear function is linear in pieces, but the pieces do not need to connect to each other. Indeed, the main purpose of the term in most mathematical uses is to represent relationships that do have discontinuities. That said, usage has a way of entrenching itself, and we have to learn to live with it.

Do the dy/dx values refer to the coefficients/slopes of their separate splines? Would I need to perform a separate hypothesis test to assess whether the two splines are significantly different, or do the interaction term results from the regress output (knot2#c.mpg) provide that information for me?

Yes, the dy/dx values represent the slopes of their separate line segments. If you want to compare the two slopes, their difference is the coefficient of the knot2#c.mpg interaction term in the regression output itself, and it has associated standard errors etc. No other calculations are needed for that purpose.
Comment
Rephaim Mpofu

Join Date: Dec 2019

Posts: 4
#5

11 Mar 2022, 00:09

Fantastic. Thank you!
Comment

Announcement