Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stepwise regression using different criteria for retention of variables - how to code?

    Dear Statalist,

    I would like to automatise a backward elimination procedure to build a multivariate model. I am building a predictive model and I would like to compare the final selected model with different approaches, using either p-value AIC or R-sq from LOOCV validation as a criterion to retain variables. I want to compare the various models and see which one has the highest predictive power.

    I have some knowledge of programming Stata, but this is getting quite advanced for me. I would be grateful if somebody could point me in the right direction by naming a few commands or a structure I could use for this. I am especially not sure how to tell Stata that I want sometime done on a vector of variables, and then that same thing done on that same vector minus one variable, with that variable changing at each iteration... (I know how to use foreach/forvalues and while, it's the vector part I am stuck with)

    Many thanks for any input!

    Sandra

  • #2
    What you are seeking to do is a complicated piece of programming and it would surprise me if anyone on the forum would put in the time necessary to do this. To get a sense of the complexity you might want to look at the code for -stepwise- (-viewsource stepwise.ado-), which is Stata's official program for stepwise variable selection. (It only uses p-value based criteria, so it would not be sufficient for your purposes.) You would have to either hack that code or write something comparable from scratch that, perhaps, was more tailored to your data.

    That said, I should also point out that most of the forum members who respond frequently (myself included) also would tell you that you shouldn't do this anyway, even if it were easy. You might want to take a look at http://www.stata.com/support/faqs/st...ems/index.html to see some of the many reasons why.

    Comment


    • #3
      I would have written essentially the same post as Clyde had he not got there first. In addition to the FAQ cited, Frank Harrell's Regression Modeling Strategies went into its second edition from Springer this year. If the discussion there doesn't change your mind on stepwise, then it's a big project ahead.

      Comment


      • #4
        Thanks both for your feedback. My impression was that most of the criticism around stepwise regression models revolved around the fact that it is based on p-values which loose their meaning given the large number of tests made. I thought that by using the LOOCV R-sq as a criterion for retention I circumvented that and also had a strategy more targeted towards my final aim: a highly predictive model. I am new to predictive modelling but thought that in this type of model it was not as important as in an explanatory model to include the right confounders and account for the correct causal pathways in my model, but rather that I should aim for "whatever" combination of variables would most closely predict the value of my observations. Anyway, I think I will ask Father Christmas to bring me Frank Harrell's book since I have been recommended it more than once already. Perhaps all the answers are in there already

        Comment


        • #5
          Given your interest in prediction, this blog entry by Paul Allison may be of interest to you. I am not sure I agree with it, and if I ever get a chance I may go over it with him some time.

          http://statisticalhorizons.com/predi...ssion-analysis

          Outside of academia, however, regression (in all its forms) is primarily used for prediction. And with the rise of Big Data, predictive regression modeling has undergone explosive growth in the last decade. It’s important, then, to ask whether our current ways of teaching regression methods really meet the needs of those who primarily use those methods for developing predictive models.
          -------------------------------------------
          Richard Williams, Notre Dame Dept of Sociology
          StataNow Version: 19.5 MP (2 processor)

          EMAIL: [email protected]
          WWW: https://www3.nd.edu/~rwilliam

          Comment


          • #6
            I have not followed the blog link, but while it is clear we need to respect different goals I think that distinction is oversold.

            Focusing on predictive success without thinking about what the model means makes no more sense than focusing on meaning without thinking about predictive success.

            It's no doubt unfair to single out someone who's likely new to the game, so I will merely allude to a recent thread in another place using different software in which a dataset with about 30 observations was the subject of a regression model with about 15 predictors. The R-square was nearly 0.9 but only one predictor was significant at conventional levels. The poster seemed very, very puzzled. That's unfortunate, but the idea evidently being followed to the exclusion of all sense was to get R-square high.

            Naturally, no approach is indicted by its most outrageous failures or by being carried to extremes -- and also stepwise methods are not the culprit here -- indeed they are advertised as a way to get better chosen models. This is just my knee-jerk reaction to the idea that sometimes prediction really is all important. (Richard won't be saying that either.)

            Comment

            Working...
            X