Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Nonparametric Linear regression

    Hello everyone,

    I have a dataset whose dependent variable is heavily skewed to the left. The independent variables are all categorical. Although the residuals are normally distributed, there is strong evidence of heteroskedasticity. The number of observations is over 500. The dependent variable was collected using on a Likert-Scale calibrated instrument. So basically, I am running a predictive model, using the xi: pr(0.15) pe(0.1): reg .... to obtain the determinants/predictors of the outcome. One of my supervisors suggests I used the nonparametric tests and provide the median and the inter-quartile ranges. When I use the "npregress kernel y i.x1 i.x2...i.xn" Stata command, I get the output with the following warning "Convergence not achieved". I have also used other median Stata commands such as sqreg, iqreg, qreg, rreg and reg y i.x1 i.x2,...i.xn, robust. These are giving me different results.

    Anyone guide me on the best way to handle such a problematic dataset? Which of the stata commands produces near-valid results? How does one maximise the npregress kernel command - the instructions on Stata are not clear and when I try, Stata produces the error output.

    Thank you

  • #2
    Hi Paul,
    Not sure about your other problems, but regarding npregress, the ide of convergence must be related to the internal process it uses to find the optimal bandwidths.
    Unfortunately, i dont think there is much to do other than perhaps recategorizing your dummy variables so you avoid having classes with too few observations.
    Regarding the other models, they will give you different results because they are different models.
    I would say that rather than trying different models, start by analyzing what is happening with your theoretical model. see if it makes sense, and then go to "robustness" using other methods.
    Fernando

    Comment


    • #3
      Thank you Fernando. Basically, the outcomes variable are heteroskedastic and has significant influential points. All the other conditions of OLS/MLS are met. I have run the hetregress command as well as the the other "robust methods". However, since I am using a predictive modelling approach, I want to use the backwards stepwise selection with the qreg command.

      A bit about the dataset: We collected data on potential determinants/predictors of health-related quality of life among postpartum women. The quality of life was collected using the WHOQOL-BREF questionnaire but instead of using the 5-Likert Scale traditionally used in the tools, we used a 3-Likert Scale due to the nature of the population we were collecting data from. The quality of life has 4 domains, each domain got from a given set of questions. All the domains, apart from the environmental domain are skewed to the left. All domains have strong heteroskedasticity.

      The predictors are 12 categorical variables. Using the Stata command: xi: sw pe(0.1) pr(0.15): qreg social i.y1 i.y2 .....i.y12, vce(robust); I get the following stata output

      VCE computation failed; try increasing the maximum number of iterations or try bsqreg r(498);

      Other outcomes (physical, psychological and environmental) are okay. Unfortunately the sw Stata command supports only qreg and not iqreg, bsreg , rreg or sqreg. So I am in a spot of bother. The social domain is calculated from 3 questions, the physical from 7, the psychological from 6 and environmental from 8 questions. I wonder if there is any connection to these.

      When I run the npregress kernel y i.x1 i.x2....i.x12, vce(bootstrap)/reps (#)

      I get the following Stata output:

      Warning: Convergence not achieved

      Is there anyway around these?

      NB: I am not a Statistician and neither an expert in Stata.

      I thank you
      Last edited by Paul Lokubal; 29 Jul 2020, 09:34.

      Comment


      • #4
        Hi again Paul,
        I ll try to be broad on my own analysis of your model.
        1. While this varies from field to field, many will strongly object to the use of Stepwise regression, because it may overfit the model. You will find some discussion in the forum about that.
        2. the problem you see with qreg is that the algorithm behind qreg cannot estimate your model (probably there is not enough variation). One option that you can use is to add the option "wlsiter(200)" or some other large number (400?). In my experience, this helps qreg to estimate the model. But, It will take a longer time, and there is no certainty it will work.
        3. the way you are defining your variables may be incorrect for the kind of analysis you want.
        consider the model "reg y i.x" where x has 4 values.
        When you run this, internally, stata creates 4 dummies and after excluding the first one it gives you :
        reg y x_2 x_3 x_4 (where x_k= 1 if x=k and zero otherwise)
        when you use this with Step wise, and say X_3 is dropped, It will put everyone in X_3 to be equal to X_1 (the base group). This would be wrong if X is ordered.
        I would instead suggest creating dumies by hand (for all 12) like this:

        x_2=x>=2 x_3=x>=3 etc
        that way, if x_3 is dropped, they would be grouped with X_2. Which i think its correct

        3. npregress will likely not work for you. I think you have too many variables considering how small your data is. You may need to contact Stata technical support for more details on why the non-convergence warning appears.

        Perhaps you should try to get a summary index from the variables (PCA perhaps) before you do the regression analysis.
        HTH
        Fernando

        Comment


        • #5
          Thank you so much Fernando. This is very helpful.

          Comment


          • #6
            A bit about the dataset: We collected data on potential determinants/predictors of health-related quality of life among postpartum women. The quality of life was collected using the WHOQOL-BREF questionnaire but instead of using the 5-Likert Scale traditionally used in the tools, we used a 3-Likert Scale due to the nature of the population we were collecting data from. The quality of life has 4 domains, each domain got from a given set of questions. All the domains, apart from the environmental domain are skewed to the left. All domains have strong heteroskedasticity.
            This sounds like a scenario where structural equation models (sem or the generalized version gsem) thrive.

            Stata has a great machinery to deal with both situations.
            Best regards,

            Marcos

            Comment

            Working...
            X