LASSO (lars) analysis with first variable fixed.

Florian Maissan

Join Date: Nov 2015

Posts: 19
#1

LASSO (lars) analysis with first variable fixed.

21 Nov 2015, 08:09

Hi,

I have a quick question regarding the LASSO technique using the lars command in stata.
I am intereseted in the effect od education on health related quality of life. The variable Education_level is the first variable i want to include in my regression
When i use the lars method:

Code:

lars HRQoL Education_level smoking_behaviour extensive_drinker blood_pressure bmi_categories cancer COPD smoking_COPD diabetes blood_diabetes muskulo diabetes_muskulo age age2 gender marital_status, algorithm(lars)

I get an output where it says i should include Education_level at the 8th place (before muskulo and age for example.)
Is there a way to use the Lars method with education_level fixed as first added in the regression? I could not find anything like this on the web.

Thanks in advance!

Florian
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

21 Nov 2015, 12:48

You are asked in the FAQ to identify the source of unofficial commands. lars, by Adrian Mander, can be downloaded from SSC. FAQ 12 asks that you show not only commands, but results too. I'm guessing that when you mention "8th place" you are referring to the steps at which variables are added. However the results also include the model with minimum \(C_p\); that would usually be the one to select. If Education is in that model, then the step at which it was added is irrelevant.

If Education was not included in that model, see the suggestion for forcing a set of variables into the model on p. 422 of Efron et al. (2004): Do the one-variable regression of your outcome on Education; generate the residuals r; then apply LARS to the regression of r on the remaining variables. Take the optimum model;, add Education to those predictors; and do the final regression (LARS-OLS hybrid, p. 421).

However, your model contains only linear terms in the predictors, a strong assumption. You can force the linear terms into the model, as above, then have LARS screen interactions. All predictors, including interactions, should be centered first (p. 497).

Unfortunately, there is no simple way to estimate standard errors after the lasso. For example, those reported after OLS regression on the predictors selected by LARS/lasso are not valid. Bootstrapping is the only suggestion that Rob Tibshirani, inventor of the lasso,, had in his 2011 retrospective .(Tibshirani, 2011, page 281).

References:

Efron, Bradley, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. 2004. Least angle regression. The Annals of statistics 32, no. 2: 407-499. available from
http://projecteuclid.org/download/pd...aos/1083178935

Tibshirani, Robert. 2011. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73, no. 3: 273-282.

Last edited by Steve Samuels; 21 Nov 2015, 13:01.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#3

22 Nov 2015, 08:16

Correction: Since you choose Education in advance to be in the LARS-OLS hybrid model, the standard error of its coefficient in that model will be valid. The problematic standard errors belong to the variables selected by lars. In fact, if you bootstrap lars, the variables in the minimum \(C_p\) set may vary from replicate to replicate.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Florian Maissan

Join Date: Nov 2015
Posts: 19

23 Nov 2015, 06:56

HI,

I'm sorry, I did not know that I needed to mention the source when using the lars software. Firstly: thank you very much for your reply! I'm still a little bit confused. Below you find the output of my lasso analysis. Based on this outcome i should include all the variables? Becauce the minimum Cp value is at the last step. The reason for my question was as follows. I'm investigating the effect of education on health related quality of life when adding other explanatory variables. In the beginning I did this based on a intuitive way where i choose when I added which variable. After this approach i want to use a systematic approach when it comes to adding the variables.That's where the lasso analysis is used. The problem is; i'm interested in the effect of education on HRQoL so this has to be the first variable that i include in my regression and not muskulo (like is said when doing the lasso analysis). Isn't there any option when using the lars software where i keep education fixed and then look at the optimal approach for adding the rest of the explanatory variables?!

Again, thank you very very much!

Florian.

Code:

Algorithm is lars

Cp, R-squared and Actions along the sequence of models

+---------------------------------------------------+
| Step |      Cp     | R-square |  Action           |
|------+-------------+----------+-------------------|
|    1 |  1665.4299  |  0.0000  |                   |
|    2 |   810.8905  |  0.0728  | +muskulo          |
|    3 |   726.3101  |  0.0802  | +COPD             |
|    4 |   712.4775  |  0.0815  | +marital_status   |
|    5 |   507.4564  |  0.0991  | +gender           |
|    6 |   456.2396  |  0.1036  | +diabetes_muskulo |
|    7 |   319.9618  |  0.1154  | +blood_pressure   |
|    8 |   253.2218  |  0.1212  | +Education_level  |
|    9 |   223.8042  |  0.1239  | +smoking_COPD     |
|   10 |   187.2914  |  0.1272  | +cancer           |
|   11 |    83.7418  |  0.1361  | +diabetes         |
|   12 |    76.7140  |  0.1369  | +bmi_categories   |
|   13 |    47.9668  |  0.1395  | +extensive_drinker |
|   14 |    36.0369  |  0.1407  | +smoking_behaviour |
|   15 |    28.7482  |  0.1415  | +age2             |
|   16 |    21.9654  |  0.1422  | +blood_diabetes   |
|   17 |    17.0000 *|  0.1428  | +age              |
+---------------------------------------------------+
* indicates the smallest value for Cp

The coefficient values for the minimum Cp

+----------------------------------+
| Variable          |  Coefficient |
|-------------------+--------------|
| Education_level   |       0.0066 |
| smoking_behaviour |       0.0020 |
| extensive_drinker |      -0.0220 |
| blood_pressure    |      -0.0200 |
| bmi_categories    |       0.0048 |
| cancer            |      -0.0208 |
| COPD              |      -0.0399 |
| smoking_COPD      |      -0.0277 |
| diabetes          |      -0.0160 |
| blood_diabetes    |      -0.0025 |
| muskulo           |      -0.0625 |
| diabetes_muskulo  |      -0.0252 |
| age               |       0.0011 |
| gender            |      -0.0216 |
| marital_status    |       0.0262 |
+----------------------------------+

Comment

Steve Samuels

Join Date: Mar 2014

Posts: 1786
#5

23 Nov 2015, 13:02

I described in my post what you must do to force education into the model before running lars on the remaining variables. The best model so far looks like the full model, but as you have not examined interactions, nor run diagnostics or fit checks, I would consider it preliminary only. Perhaps I did not make clear that the final step in LARS-OLS hybrid is an ordinary multiple linear regression (OLS). I myself would use mmregress as a check to guard against outliers and to identify high leverage points.

Last edited by Steve Samuels; 23 Nov 2015, 13:11.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement