Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple linear regression using Splined data as separate models vs mkspline & combined model

    Hi Statalisters,

    I have a dataset that I would like to fit a multiple linear regression using spline regression with mkspline.

    My issue relates to reproducibility, validation, and probably understanding the "under the hood" mechanics of piecewise regression: I can run a simple linear regression using regress and inclusion of the splined data. I can then replicate the coefficients by running separate simple linear regressions and including if conditions to delineate which spline should be included as an independent variable. When I replicate this method in a mulitple linear regression however, my coefficients (and hence slopes) are different depending on whether I run separate regressions or one regression with multiple splines.

    I would preferably like to replicate the results from the combined spline regression model by conducting separate regression models as confirmation that I am conducting the spline regression correctly.

    I have included test code using the auto dataset for explanatory purposes which is largely based on the UCLA FAQ page titled "How can I run a piecewise regression in Stata":

    Code:
    sysuse auto
    
    graph twoway ///
        (scatter price mpg) ///
        (lowess price mpg)
        
    regress price mpg, vce(robust)
    
    //Method 1: simple regression with mkspline    
    nl hockey price mpg
    local knotvalue = 26.24305
    
    mkspline mpg_knot1 `knotvalue' mpg_knot2 = mpg
    
    regress price mpg_knot1 mpg_knot2
    
    
    //Method 2: simple regression with separate regressions - no centering
    regress price mpg if mpg < `knotvalue'
    
    regress price mpg if mpg >= `knotvalue'
    
    //Method 3: simple regression with separate regressions and centering
    capture drop mpg_knot*
    gen mpg_knot2 = mpg - `knotvalue'
    
    regress price mpg_knot2 if mpg < `knotvalue'
    
    regress price mpg_knot2 if mpg >= `knotvalue'
    
    *Method 4: Multiple regression with mkspline in combined model
    capture drop mpg_knot*
    mkspline mpg_knot1 `knotvalue' mpg_knot2 = mpg
    
    regress price weight mpg_knot1 mpg_knot2
    
    *Method 5: Multiple regression with separate regressions
    regress price weight mpg if mpg < `knotvalue'
    
    regress price weight mpg if mpg >= `knotvalue'

    I have also included an excerpt of the data in question. The aim is to fit a spline mulitple regression on egfr_final using egfr_baseline, treatmentgroup, age category and presence of diabetes as covariates.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float(pid age_cat) byte(treatmentgroup diabetes) int(egfr_baseline egfr_final)
    189 3 1 0  85 111
    136 3 2 0 110 117
    120 4 1 0 117 123
     25 3 2 0 127  96
    141 2 2 0 137 154
    115 3 1 0 110 109
    266 3 2 0  74  78
     74 3 1 0 111  98
     58 4 1 0 118 101
    134 2 2 0 127 129
    end
    label values age_cat labelagecat
    label values treatmentgroup labeltreatmentgroup
    label def labeltreatmentgroup 1 "AmphoB", modify
    label def labeltreatmentgroup 2 "Fluconazole", modify
    label values diabetes labelyesno
    I have consulted the mkspline pdf documentation, Michael Mitchell's textbook on interpreting and visualizing regression models using stata, and a few other resources that have been recommended on Statalist.

    Stata version: Stata version 16.1 IC

    Appreciate the assistance.

  • #2
    Using the spline variables in the regression is not equivalent to running separate regressions on either side of the knot.

    Splines do not, in fact, implement piecewise linear regression. Here's the difference. When you use a linear spline, the regression lines on either side of the knot are constrained to meet at the knot. In piecewise linear regression (or if you just regress separately on either side of the knot) the lines do not have to meet at the knot, and in most cases there will actually be a gap between them there.

    If you want to emulate piecewise linear regression in a single regression command, you could do it like this:
    Code:
    local knotvalue 26.24305
    gen byte bigger = mpg >= `knotvalue'
    
    regress price i.bigger##c.mpg
    margins bigger, dydx(mpg)
    Here the i.bigger term provides the gap between the line segments to the left and right of the knot. (In the auto dataset, the gap happens to be very small, so this may not be the best example to use, but the principle applies anyway.)

    Comment


    • #3
      Thanks Clyde for this response.

      Perhaps these are simply differences in terminology or else possibly one of the largest misconceptions in statistics, but many references refer to these types of regressions as being the same. Mitchell's text book on interpreting and visualizing regressions says the following
      "A piecewise regression goes by several names, including spline regression, broken line regression, broken stick regression, and even hockey stick models."
      A pdf document titled Nonlinear relationships by Richard Williams, frequent contributor on statalist, also appears to allude to these techniques being similar, and I think he may also have referenced a similar technique you used here in the previous example as "switching regression", only difference being that he used marginal differences between subsequent splines using the marginal option on mkspline.

      Is it possible that piecewise and spline regression can be similar, but spline regression data are a subtype where the data are always constrained to their knots, while piecewise regression's use discontinuous data that have jumps by including distinct intercepts/jumps?

      Using your example above, I obtain the following output:

      Code:
      regress price i.knot2##c.mpg
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(3, 70)        =      9.16
             Model |   179048954         3  59682984.7   Prob > F        =    0.0000
          Residual |   456016442        70   6514520.6   R-squared       =    0.2819
      -------------+----------------------------------   Adj R-squared   =    0.2512
             Total |   635065396        73  8699525.97   Root MSE        =    2552.4
      
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
           1.knot2 |  -11992.34   6521.374    -1.84   0.070    -24998.81    1014.134
               mpg |  -388.0199   86.54718    -4.48   0.000    -560.6328    -215.407
                   |
       knot2#c.mpg |
                1  |   456.9718   215.0009     2.13   0.037     28.16598    885.7777
                   |
             _cons |    14056.5   1716.008     8.19   0.000     10634.03    17478.97
      ------------------------------------------------------------------------------
      
      margins knot2, dydx(mpg)
      
      Average marginal effects                        Number of obs     =         74
      Model VCE    : OLS
      
      Expression   : Linear prediction, predict()
      dy/dx w.r.t. : mpg
      
      ------------------------------------------------------------------------------
                   |            Delta-method
                   |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      mpg          |
             knot2 |
                0  |  -388.0199   86.54718    -4.48   0.000    -560.6328    -215.407
                1  |   68.95189   196.8121     0.35   0.727    -323.5774    461.4812
      ------------------------------------------------------------------------------
      Do the dy/dx values refer to the coefficients/slopes of their separate splines? Would I need to perform a separate hypothesis test to assess whether the two splines are significantly different, or do the interaction term results from the regress output (knot2#c.mpg) provide that information for me?

      Comment


      • #4
        Statistical terminology is not always used as consistently as it should be. The terms "broken stick" or "hockey stick" regression make sense as synonyms for spline regression because a broken stick or hockey stick still has a connection at the change in slope. But the term piecewise linear is a mathematical term whose use predates the birth of modern statistics, and it should be respected. A piecewise linear function is linear in pieces, but the pieces do not need to connect to each other. Indeed, the main purpose of the term in most mathematical uses is to represent relationships that do have discontinuities. That said, usage has a way of entrenching itself, and we have to learn to live with it.

        Do the dy/dx values refer to the coefficients/slopes of their separate splines? Would I need to perform a separate hypothesis test to assess whether the two splines are significantly different, or do the interaction term results from the regress output (knot2#c.mpg) provide that information for me?
        Yes, the dy/dx values represent the slopes of their separate line segments. If you want to compare the two slopes, their difference is the coefficient of the knot2#c.mpg interaction term in the regression output itself, and it has associated standard errors etc. No other calculations are needed for that purpose.

        Comment


        • #5
          Fantastic. Thank you!

          Comment

          Working...
          X