I am starting to use Lasso and cross validation to model selection for explain a dependent variable using linear models, but I can not understand why all p-values coefficients in selected model are not lower to 0.05:
I use the steps to make this example posted in:
https://www.stata.com/features/overv...on-prediction/
Here gear_ratio was selected but its p-value its 0.218, too much high to explain mpg?
I miss some step or concept in model selection using Lasso and cross-validation?
I now that Lasso not use p-value to select the model, but I should remove gear_ratio in the final model?
Any comment I would gratefull
Thanks in advance
Rodrigo
I use the steps to make this example posted in:
https://www.stata.com/features/overv...on-prediction/
Code:
sysuse auto, clear splitsample, generate(sample) nsplit(2) rseed(1234) lasso linear mpg i.foreign i.rep78 headroom weight turn gear_ratio price trunk length displacement if sample == 1, selection(bic) estimates store bic lasso linear mpg i.foreign i.rep78 headroom weight turn gear_ratio price trunk length displacement if sample == 1 estimates store cv lasso linear mpg i.foreign i.rep78 headroom weight turn gear_ratio price trunk length displacement if sample == 1, selection(adaptive) estimates store adaptive lassocoef cv bic adaptive, sort(coef, standardized) ---------------------------------------------- | cv bic adaptive -------------+-------------------------------- weight | x x 5.rep78 | x x x length | x x x gear_ratio | x x price | x x _cons | x x x ---------------------------------------------- Legend: b - base level e - empty cell o - omitted x - estimated lassogof cv bic adaptive, over(sample) postselection Postselection coefficients ------------------------------------------------------------- Name sample | MSE R-squared Obs ------------------------+------------------------------------ cv | 1 | 10.92984 0.7046 35 2 | 10.77016 0.6496 34 ------------------------+------------------------------------ bic | 1 | 11.82234 0.6805 35 2 | 11.0608 0.6401 34 ------------------------+------------------------------------ adaptive | 1 | 10.98369 0.7032 35 2 | 10.56047 0.6564 34 ------------------------------------------------------------- *adaptive have the lower MSE and higher R^2 in sample 2 *I select adaptative as best model: . reg mpg length 5.rep78 gear_ratio price Source | SS df MS Number of obs = 69 -------------+---------------------------------- F(4, 64) = 38.57 Model | 1654.03213 4 413.508033 Prob > F = 0.0000 Residual | 686.170766 64 10.7214182 R-squared = 0.7068 -------------+---------------------------------- Adj R-squared = 0.6885 Total | 2340.2029 68 34.4147485 Root MSE = 3.2744 ------------------------------------------------------------------------------ mpg | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- length | -.1484211 .0270004 -5.50 0.000 -.2023607 -.0944816 5.rep78 | 3.380391 1.163954 2.90 0.005 1.055125 5.705657 gear_ratio | 1.558014 1.251733 1.24 0.218 -.9426098 4.058637 price | -.0002964 .0001545 -1.92 0.060 -.0006049 .0000122 _cons | 45.84562 7.960131 5.76 0.000 29.94343 61.74781 ------------------------------------------------------------------------------
I miss some step or concept in model selection using Lasso and cross-validation?
I now that Lasso not use p-value to select the model, but I should remove gear_ratio in the final model?
Any comment I would gratefull
Thanks in advance
Rodrigo