Piecewise regression

Tron Hersleth

Join Date: Aug 2019

Posts: 6
#1

Piecewise regression

09 Aug 2019, 02:50

Dear stalist

I am running a piecewise regression. I want to use a dataset that contains two different years, 2016 and 2017. I will include a if function to tell STATA which year to run the regression on. However when i run the regression i get an error message saying: " starting values invalid or some RHS variables have missing values". Any ideas on how to solve this?

This is my function: nl (KOSTBHG = SIZE*{b1} + (SIZE>{c})*( SIZE-{c})*{b2} if Year==2017), variables(SIZE) initial(b1 0 c 0 b2 0) noconstant trace
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

14 Aug 2019, 12:50

You didn't get a quick answer. You'll increase your chances of a helpful answer by following the FAQ on asking questions.

It is hard to diagnose things without being able to run them. The first thing might be to delete all the observations you don't think will have good values for your estimation and try again. Next, check that you have good observations on all the RHS variables. Next, try a simpler model and build up.

I doubt if (SIZE>{c})*( SIZE-{c})*{b2} is what you want, but I could be wrong. the > seems odd to me. I'm also not sure you need to subtrac c - does this do anything?
Comment

FernandoRios

Join Date: Apr 2014
Posts: 2469

14 Aug 2019, 13:49

Hi Tron,
In addition to Phil's advice, I would also provide NL with good starting values.
See the following example

Code:

** Goal is to find the optimal threshold for mpg to explain prices
sysuse auto, clear
nl (price={b0}+{b1}*mpg+{b2}*(mpg-{c})*(mpg>{c})), variables(mpg) initial(b1 0 c 0 b2 0) 
** this are BAD initial values. That is why you get the output below

(obs = 74)

Iteration 0:  residual SS =  4.96e+08
Iteration 1:  residual SS =  4.96e+08
Iteration 2:  residual SS =  4.96e+08
Iteration 3:  residual SS =  4.96e+08
Iteration 4:  residual SS =  4.96e+08
Iteration 5:  residual SS =  4.96e+08


      Source |      SS            df       MS
-------------+----------------------------------    Number of obs =         74
       Model |  1.394e+08          1   139449474    R-squared     =     0.2196
    Residual |  4.956e+08         72  6883554.48    Adj R-squared =     0.2087
-------------+----------------------------------    Root MSE      =   2623.653
       Total |  6.351e+08         73  8699525.97    Res. dev.     =   1373.079

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         /b0 |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
         /b1 |  -238.8823   53.07669    -4.50   0.000    -344.6888   -133.0759
         /b2 |  -.0120003          .        .       .            .           .
          /c |          0  (constrained)
------------------------------------------------------------------------------
** instead we can provide "good" initial values
. sum mpg

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         74     21.2973    5.785503         12         41
** using 22 as a reasonably good initial value for the threshold
** with that i create the second part of the linear spline

gen mpg2=(mpg-22)*(mpg>22)
** and estimate the model
. reg price mpg mpg2

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     17.42
       Model |   209062289         2   104531144   Prob > F        =    0.0000
    Residual |   426003107        71  6000043.77   R-squared       =    0.3292
-------------+----------------------------------   Adj R-squared   =    0.3103
       Total |   635065396        73  8699525.97   Root MSE        =    2449.5

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |   -575.543   110.5615    -5.21   0.000    -795.9964   -355.0895
        mpg2 |   578.1094   169.7238     3.41   0.001     239.6899     916.529
       _cons |   17289.98   2082.322     8.30   0.000     13137.95    21442.02
------------------------------------------------------------------------------
** This gives me good initial values for the c b0 b1 and b2

nl (price={b0}+{b1}*mpg+{b2}*(mpg-{c})*(mpg>{c})), variables(mpg) initial(b0 17289 b1 -575  b2 578 c 22) 

      Source |      SS            df       MS
-------------+----------------------------------    Number of obs =         74
       Model |  2.781e+08          3  92714494.7    R-squared     =     0.4380
    Residual |  3.569e+08         70  5098884.46    Adj R-squared =     0.4139
-------------+----------------------------------    Root MSE      =   2258.071
       Total |  6.351e+08         73  8699525.97    Res. dev.     =   1348.786

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         /b0 |   27100.74   3675.264     7.37   0.000     19770.66    34430.83
         /b1 |  -1197.627   228.0927    -5.25   0.000    -1652.543   -742.7103
         /b2 |   1137.282   237.5921     4.79   0.000     663.4195    1611.144
          /c |   18.00002   .7268266    24.77   0.000     16.55041    19.44963
------------------------------------------------------------------------------
  Parameter b0 taken as constant term in model & ANOVA table

Hope this helps
Fernando

Comment

Tron Hersleth

Join Date: Aug 2019

Posts: 6
#4

15 Aug 2019, 02:19

Thank you very much for the reply! FernandoRios, I have now tested your method for deciding intial values and it gives the right cut off values. I was just wondering if you could help me understand what the intial values means?
Comment
FernandoRios

Join Date: Apr 2014

Posts: 2469
#5

15 Aug 2019, 04:05

Sure
as far as I know, all maximization methods done by computers are basically iterative methods that update an initial guess to find a value that maximizes or minimizes the objective function
something like b_1=b_0+change
here b_0 is the initial value and the change will depend on that initial value. If it is a good initial value, say it’s close to what solves the problem, the algorithm will find a solution fast
if it’s a bad initial value, meaning too far or too different from the solution, the algorithm may just wonder around without finding a meaningful solution
hth
fernando
Comment
Tron Hersleth

Join Date: Aug 2019

Posts: 6
#6

16 Aug 2019, 03:58

Thank you for a well explained answer! Also I was wondering how to check the robustness of the results of the output from a piece wise regression.

I have divided the procedure into two. I carry out the two following steps:

1) Step 1: nl (KOSTBHG = SIZE*{b1} + (SIZE>{c})*( SIZE-{c})*{b2}), variables(SIZE) initial(b1 0 c 28 b2 0) noconstant, if(Year==2014)

Is there any value looking at t-statistics, p-value and R-squared here?

2) Step 2: regress KOSTBHG SIZE SIZE2 , noconstant, if(Year==2012)

Can you here interpret the results as a normal regression? I.e. can you here look at the p-values and t-statistics to decide the rosbustness of the modell?
Comment

FernandoRios

Join Date: Apr 2014
Posts: 2469

16 Aug 2019, 06:55

Hi Tron
It all depends on what do you mean robustness.
First of all, why are you not including a constant? That seems odd, and unless you can say for sure the constant is zero (like in a demean regression model), I always recommend my own students to include a constant.
Second. You cant compare both regressions as they are analyzing different years.
Third, im guessing size2 is simply size^2

So compare the following two models:

Code:

sysuse auto, clear


. nl (price={b0}+{b1}*mpg+{b2}*(mpg-{c})*(mpg>{c})), variables(mpg) initial(b0 27162.17 b1 -1201.957  b2 1142.664 c 18)
(obs = 74)

Iteration 0:  residual SS =  3.57e+08


      Source |      SS            df       MS
-------------+----------------------------------    Number of obs =         74
       Model |  2.782e+08          3  92717612.6    R-squared     =     0.4380
    Residual |  3.569e+08         70  5098750.83    Adj R-squared =     0.4139
-------------+----------------------------------    Root MSE      =   2258.041
       Total |  6.351e+08         73  8699525.97    Res. dev.     =   1348.784

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         /b0 |   27162.17   3675.264     7.39   0.000     19832.09    34492.25
         /b1 |  -1201.957   228.0927    -5.27   0.000    -1656.874   -747.0405
         /b2 |   1142.664   237.5921     4.81   0.000     668.8015    1616.526
          /c |         18    .723393    24.88   0.000     16.55724    19.44276
------------------------------------------------------------------------------
  Parameter b0 taken as constant term in model & ANOVA table
gen mpg4=(mpg-_b[/c])*(mpg>_b[/c])

. reg price mpg mpg4

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     27.67
       Model |   278152835         2   139076418   Prob > F        =    0.0000
    Residual |   356912561        71  5026937.47   R-squared       =    0.4380
-------------+----------------------------------   Adj R-squared   =    0.4222
       Total |   635065396        73  8699525.97   Root MSE        =    2242.1

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -1201.957   188.8696    -6.36   0.000    -1578.552   -825.3618
        mpg4 |   1142.664   217.5337     5.25   0.000     708.9136    1576.413
       _cons |   27162.17   3189.672     8.52   0.000     20802.14    33522.19
------------------------------------------------------------------------------

This models are identical, and you can interpret their stats directly. The difference between NL and REG outputs is that "regress" (second option) assumes that you already know "C" to be 18. So based on that, there is one less parameter that needs to be analyzed, and the standard errors are smaller.
The first option, NL, does not assume C is known, so tries to include it in the analysis. So, there is an additional parameter that needs to be estimated, and the standard errors are larger.

HTH
Fernando

Announcement