Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Lasso and cross validation: model selection

    I am starting to use Lasso and cross validation to model selection for explain a dependent variable using linear models, but I can not understand why all p-values coefficients in selected model are not lower to 0.05:

    I use the steps to make this example posted in:

    https://www.stata.com/features/overv...on-prediction/

    Code:
    sysuse auto, clear
    splitsample, generate(sample) nsplit(2) rseed(1234)
    
    lasso linear mpg i.foreign i.rep78 headroom weight turn gear_ratio price trunk length displacement if sample == 1, selection(bic)
    estimates store bic
    
    lasso linear mpg i.foreign i.rep78 headroom weight turn gear_ratio price trunk length displacement if sample == 1
    estimates store cv
    
    lasso linear mpg i.foreign i.rep78 headroom weight turn gear_ratio price trunk length displacement if sample == 1, selection(adaptive)
    estimates store adaptive
    
    lassocoef cv bic adaptive, sort(coef, standardized)
    
    
    ----------------------------------------------
                 |    cv        bic      adaptive
    -------------+--------------------------------
          weight |     x         x    
         5.rep78 |     x         x          x    
          length |     x         x          x    
      gear_ratio |     x                    x    
           price |     x                    x    
           _cons |     x         x          x    
    ----------------------------------------------
    Legend:
      b - base level
      e - empty cell
      o - omitted
      x - estimated
    
    
    
    
    lassogof cv bic adaptive, over(sample) postselection
    
    Postselection coefficients
    -------------------------------------------------------------
    Name             sample |         MSE    R-squared        Obs
    ------------------------+------------------------------------
    cv                      |
                          1 |    10.92984       0.7046         35
                          2 |    10.77016       0.6496         34
    ------------------------+------------------------------------
    bic                     |
                          1 |    11.82234       0.6805         35
                          2 |     11.0608       0.6401         34
    ------------------------+------------------------------------
    adaptive                |
                          1 |    10.98369       0.7032         35
                          2 |    10.56047       0.6564         34
    -------------------------------------------------------------
    
    *adaptive have the lower MSE and higher R^2 in sample 2
    
    *I select adaptative as best model:
    
    . reg mpg length 5.rep78 gear_ratio price
    
          Source |       SS           df       MS      Number of obs   =        69
    -------------+----------------------------------   F(4, 64)        =     38.57
           Model |  1654.03213         4  413.508033   Prob > F        =    0.0000
        Residual |  686.170766        64  10.7214182   R-squared       =    0.7068
    -------------+----------------------------------   Adj R-squared   =    0.6885
           Total |   2340.2029        68  34.4147485   Root MSE        =    3.2744
    
    ------------------------------------------------------------------------------
             mpg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
          length |  -.1484211   .0270004    -5.50   0.000    -.2023607   -.0944816
         5.rep78 |   3.380391   1.163954     2.90   0.005     1.055125    5.705657
      gear_ratio |   1.558014   1.251733     1.24   0.218    -.9426098    4.058637
           price |  -.0002964   .0001545    -1.92   0.060    -.0006049    .0000122
           _cons |   45.84562   7.960131     5.76   0.000     29.94343    61.74781
    ------------------------------------------------------------------------------
    Here gear_ratio was selected but its p-value its 0.218, too much high to explain mpg?

    I miss some step or concept in model selection using Lasso and cross-validation?

    I now that Lasso not use p-value to select the model, but I should remove gear_ratio in the final model?

    Any comment I would gratefull

    Thanks in advance
    Rodrigo
    Last edited by Rodrigo Badilla; 15 Dec 2024, 18:27.
Working...
X