Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Piecewise regression

    Dear stalist

    I am running a piecewise regression. I want to use a dataset that contains two different years, 2016 and 2017. I will include a if function to tell STATA which year to run the regression on. However when i run the regression i get an error message saying: " starting values invalid or some RHS variables have missing values". Any ideas on how to solve this?

    This is my function: nl (KOSTBHG = SIZE*{b1} + (SIZE>{c})*( SIZE-{c})*{b2} if Year==2017), variables(SIZE) initial(b1 0 c 0 b2 0) noconstant trace

  • #2
    You didn't get a quick answer. You'll increase your chances of a helpful answer by following the FAQ on asking questions.

    It is hard to diagnose things without being able to run them. The first thing might be to delete all the observations you don't think will have good values for your estimation and try again. Next, check that you have good observations on all the RHS variables. Next, try a simpler model and build up.

    I doubt if (SIZE>{c})*( SIZE-{c})*{b2} is what you want, but I could be wrong. the > seems odd to me. I'm also not sure you need to subtrac c - does this do anything?

    Comment


    • #3
      Hi Tron,
      In addition to Phil's advice, I would also provide NL with good starting values.
      See the following example
      Code:
      ** Goal is to find the optimal threshold for mpg to explain prices
      sysuse auto, clear
      nl (price={b0}+{b1}*mpg+{b2}*(mpg-{c})*(mpg>{c})), variables(mpg) initial(b1 0 c 0 b2 0) 
      ** this are BAD initial values. That is why you get the output below
      
      (obs = 74)
      
      Iteration 0:  residual SS =  4.96e+08
      Iteration 1:  residual SS =  4.96e+08
      Iteration 2:  residual SS =  4.96e+08
      Iteration 3:  residual SS =  4.96e+08
      Iteration 4:  residual SS =  4.96e+08
      Iteration 5:  residual SS =  4.96e+08
      
      
            Source |      SS            df       MS
      -------------+----------------------------------    Number of obs =         74
             Model |  1.394e+08          1   139449474    R-squared     =     0.2196
          Residual |  4.956e+08         72  6883554.48    Adj R-squared =     0.2087
      -------------+----------------------------------    Root MSE      =   2623.653
             Total |  6.351e+08         73  8699525.97    Res. dev.     =   1373.079
      
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               /b0 |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
               /b1 |  -238.8823   53.07669    -4.50   0.000    -344.6888   -133.0759
               /b2 |  -.0120003          .        .       .            .           .
                /c |          0  (constrained)
      ------------------------------------------------------------------------------
      ** instead we can provide "good" initial values
      . sum mpg
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
               mpg |         74     21.2973    5.785503         12         41
      ** using 22 as a reasonably good initial value for the threshold
      ** with that i create the second part of the linear spline
      
      gen mpg2=(mpg-22)*(mpg>22)
      ** and estimate the model
      . reg price mpg mpg2
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(2, 71)        =     17.42
             Model |   209062289         2   104531144   Prob > F        =    0.0000
          Residual |   426003107        71  6000043.77   R-squared       =    0.3292
      -------------+----------------------------------   Adj R-squared   =    0.3103
             Total |   635065396        73  8699525.97   Root MSE        =    2449.5
      
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               mpg |   -575.543   110.5615    -5.21   0.000    -795.9964   -355.0895
              mpg2 |   578.1094   169.7238     3.41   0.001     239.6899     916.529
             _cons |   17289.98   2082.322     8.30   0.000     13137.95    21442.02
      ------------------------------------------------------------------------------
      ** This gives me good initial values for the c b0 b1 and b2
      
      nl (price={b0}+{b1}*mpg+{b2}*(mpg-{c})*(mpg>{c})), variables(mpg) initial(b0 17289 b1 -575  b2 578 c 22) 
      
            Source |      SS            df       MS
      -------------+----------------------------------    Number of obs =         74
             Model |  2.781e+08          3  92714494.7    R-squared     =     0.4380
          Residual |  3.569e+08         70  5098884.46    Adj R-squared =     0.4139
      -------------+----------------------------------    Root MSE      =   2258.071
             Total |  6.351e+08         73  8699525.97    Res. dev.     =   1348.786
      
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               /b0 |   27100.74   3675.264     7.37   0.000     19770.66    34430.83
               /b1 |  -1197.627   228.0927    -5.25   0.000    -1652.543   -742.7103
               /b2 |   1137.282   237.5921     4.79   0.000     663.4195    1611.144
                /c |   18.00002   .7268266    24.77   0.000     16.55041    19.44963
      ------------------------------------------------------------------------------
        Parameter b0 taken as constant term in model & ANOVA table
      Hope this helps
      Fernando

      Comment


      • #4
        Thank you very much for the reply! FernandoRios, I have now tested your method for deciding intial values and it gives the right cut off values. I was just wondering if you could help me understand what the intial values means?

        Comment


        • #5
          Sure
          as far as I know, all maximization methods done by computers are basically iterative methods that update an initial guess to find a value that maximizes or minimizes the objective function
          something like b_1=b_0+change
          here b_0 is the initial value and the change will depend on that initial value. If it is a good initial value, say it’s close to what solves the problem, the algorithm will find a solution fast
          if it’s a bad initial value, meaning too far or too different from the solution, the algorithm may just wonder around without finding a meaningful solution
          hth
          fernando

          Comment


          • #6
            Thank you for a well explained answer! Also I was wondering how to check the robustness of the results of the output from a piece wise regression.

            I have divided the procedure into two. I carry out the two following steps:

            1) Step 1: nl (KOSTBHG = SIZE*{b1} + (SIZE>{c})*( SIZE-{c})*{b2}), variables(SIZE) initial(b1 0 c 28 b2 0) noconstant, if(Year==2014)

            Is there any value looking at t-statistics, p-value and R-squared here?

            2) Step 2: regress KOSTBHG SIZE SIZE2 , noconstant, if(Year==2012)

            Can you here interpret the results as a normal regression? I.e. can you here look at the p-values and t-statistics to decide the rosbustness of the modell?



            Comment


            • #7
              Hi Tron
              It all depends on what do you mean robustness.
              First of all, why are you not including a constant? That seems odd, and unless you can say for sure the constant is zero (like in a demean regression model), I always recommend my own students to include a constant.
              Second. You cant compare both regressions as they are analyzing different years.
              Third, im guessing size2 is simply size^2

              So compare the following two models:
              Code:
              sysuse auto, clear
              
              
              . nl (price={b0}+{b1}*mpg+{b2}*(mpg-{c})*(mpg>{c})), variables(mpg) initial(b0 27162.17 b1 -1201.957  b2 1142.664 c 18)
              (obs = 74)
              
              Iteration 0:  residual SS =  3.57e+08
              
              
                    Source |      SS            df       MS
              -------------+----------------------------------    Number of obs =         74
                     Model |  2.782e+08          3  92717612.6    R-squared     =     0.4380
                  Residual |  3.569e+08         70  5098750.83    Adj R-squared =     0.4139
              -------------+----------------------------------    Root MSE      =   2258.041
                     Total |  6.351e+08         73  8699525.97    Res. dev.     =   1348.784
              
              ------------------------------------------------------------------------------
                     price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                       /b0 |   27162.17   3675.264     7.39   0.000     19832.09    34492.25
                       /b1 |  -1201.957   228.0927    -5.27   0.000    -1656.874   -747.0405
                       /b2 |   1142.664   237.5921     4.81   0.000     668.8015    1616.526
                        /c |         18    .723393    24.88   0.000     16.55724    19.44276
              ------------------------------------------------------------------------------
                Parameter b0 taken as constant term in model & ANOVA table
              gen mpg4=(mpg-_b[/c])*(mpg>_b[/c])
              
              . reg price mpg mpg4
              
                    Source |       SS           df       MS      Number of obs   =        74
              -------------+----------------------------------   F(2, 71)        =     27.67
                     Model |   278152835         2   139076418   Prob > F        =    0.0000
                  Residual |   356912561        71  5026937.47   R-squared       =    0.4380
              -------------+----------------------------------   Adj R-squared   =    0.4222
                     Total |   635065396        73  8699525.97   Root MSE        =    2242.1
              
              ------------------------------------------------------------------------------
                     price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                       mpg |  -1201.957   188.8696    -6.36   0.000    -1578.552   -825.3618
                      mpg4 |   1142.664   217.5337     5.25   0.000     708.9136    1576.413
                     _cons |   27162.17   3189.672     8.52   0.000     20802.14    33522.19
              ------------------------------------------------------------------------------
              This models are identical, and you can interpret their stats directly. The difference between NL and REG outputs is that "regress" (second option) assumes that you already know "C" to be 18. So based on that, there is one less parameter that needs to be analyzed, and the standard errors are smaller.
              The first option, NL, does not assume C is known, so tries to include it in the analysis. So, there is an additional parameter that needs to be estimated, and the standard errors are larger.

              HTH
              Fernando

              Comment

              Working...
              X