Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Violation of Regression Assumption 'Linearity in Parameters'

    Hello everyone,

    I´m doing a multiple linear regression for my master thesis.
    My research involves a dependent variable, "Number of employees participating in further training," and my main independent variable, "Number of vacant positions."
    The regression model assumption, that the model is linear in the parameters does not seem to be met. Now I am wondering what corrections I can make to address this challenge. Are there alternative approaches, or is linear regression perhaps not the best choice for my dataset?

    I would appreciate your experiences and advice!

    Lisa

  • #2
    Lisa:
    what does make you think that your model is not linear in the parameters (I guess you're dealing with an OLS)?
    If that were true, you should switch to -nl-.
    Kind regards,
    Carlo
    (StataNow 18.5)

    Comment


    • #3
      Hello Carlo,

      Thank you for your prompt response.
      I've plotted the residuals of my linear regression against the independent variable and observed numerous deviations from the zero line (see attachment).
      Click image for larger version

Name:	rvpplot_h1b.png
Views:	1
Size:	45.7 KB
ID:	1739363

      I interpreted this as an indication of non-linearity in my parameters. But I might be wrong?
      Additionally, I conducted the Ramsey (1969) RESET test, and it returned as not significant.

      If by '-nl-' you are referring to non-linear regression, what type of regression could I employ if my dependent variable is continuously discrete? Moreover, if my model truly exhibits non-linearity in its parameters, is there a possibility to still utilize linear regression, perhaps through certain corrections or adjustments?

      Kind regards,
      Lisa

      Comment


      • #4
        You discuss essentially the same broad question in another thread. Nothing wrong with that, but you should let people know.

        If you type regress, then Stata fits a model via OLS. This model is linear in its parameters. As Carlo has pointed out, if you want a model that is non-linear in parameters, you need to use nl (the Stata command for non-linear [regression] models/functions) and specify that model. Both, residual plots and the RESET are primarily concerned with non-linearity in the predictors, not the parameters.
        Last edited by daniel klein; 10 Jan 2024, 07:49.

        Comment


        • #5
          Lisa: Does your Y variable equal zero a lot of the time? Is its value constrained by your X variable? The X variable has lots of zeros, but that in itself is not a problem. But what about Y? Please show summary statistics, including the range of possible values.

          Comment


          • #6

            Thank you for your answers!


            And thank you for bringing that up, Daniel. I wasn't certain if it was precisely the same issue, which is why I didn't reference the other thread.

            I might have explained it wrong then. As far as I understand it, the multiple linear regression analysis is based on the assumption that the population model is linear in its parameters (MLR1). There are graphical and statistical tests to diagnose potential non linearities. If non-linearities are present, conducting a meaningful linear regression might be challenging. I've come across the suggestion to address this issue by including polynomials.
            I could be mistaken, but that's the takeaway from my statistical course.

            Here are summary statistics for my Y value. To address your question, Jeff: yes Y equals zero a lot of the time:
            Click image for larger version

Name:	Unbenannt.JPG
Views:	1
Size:	29.6 KB
ID:	1739406
            Last edited by lisa Uuuz; 10 Jan 2024, 10:32.

            Comment


            • #7
              Lisa:
              you may want to take a look at https://en.wikipedia.org/wiki/Hurdle_model
              Kind regards,
              Carlo
              (StataNow 18.5)

              Comment


              • #8
                Are you sure you have the number of employees? Seems more like the proportion/fraction of employees to me. In that case, neither the linear model nor poission might not be the best choice. Perhaps a beta distribution* or another fractional response model?

                If some zeros arise because of a specific firm policy as opposed to vacant positions (more generally, if two different processes result in a fraction of 0), you might also want to think about a zero-inflated model.


                *Edit: I just realized that the beta regression cannot be used when 0 and 1 (the extremes) are observed, so probably some other model. See help fracreg.
                Last edited by daniel klein; 10 Jan 2024, 12:21.

                Comment


                • #9
                  Hi Daniel,
                  Sorry, you are right. The original variable is the number of employees. I transformed it into the proportion of employees who have undergone further training, measured against all employees.
                  Are you saying that it is not possible at all to perform a linear regression on this?

                  I just came across information about the Generalized Linear Model (GLM). Do you think that would be a better choice for me?
                  If it is the best option, could you let me know if it's challenging to implement? All I have experience with is multiple linear regression, so I'm quite inexperienced with Statistics and Stata. I'm wondering if it's feasible within the time I have left for my Master's thesis.

                  Kind regards
                  Lisa

                  Comment


                  • #10
                    Originally posted by lisa Uuuz View Post
                    Are you saying that it is not possible at all to perform a linear regression on this?
                    No. I am saying that the linear model might be considered an oversimplification.

                    Originally posted by lisa Uuuz View Post
                    I just came across information about the Generalized Linear Model (GLM). Do you think that would be a better choice for me?
                    If it is the best option, could you let me know if it's challenging to implement? All I have experience with is multiple linear regression, so I'm quite inexperienced with Statistics and Stata. I'm wondering if it's feasible within the time I have left for my Master's thesis.
                    If I remember correctly, you have suggested poisson as an alternative. The poisson model is a generalized linear model. Implementation in Stata is usually trivial. Whether you will be able to understand the model well enough to communicate and explain your choices, I cannot tell. If your courses only covered linear regression models, why do you think you are now supposed to apply something else in your master thesis? Perhaps you are fine applying a linear model and discussing its limitations. I can only speculate. You should probably talk to your supervisor.

                    Comment


                    • #11
                      Hello Daniel,

                      I looked into the fracreg command you suggested and it seems suitable for my data. Thank you very much for that tip!
                      I have run my fractional outcomes' regression (fracreg logit) on Stata and calculated the marginal affects (dydx).
                      However, I have one more question: Are there any diagnostics, assumptions or steps in my methodology that I should include?

                      Best Regards
                      Lisa

                      Comment


                      • #12
                        My understanding is that fractional response models are quasi-likelihood estimators. These estimators require only a few assumptions for consistency. Using robust standard errors, you just need to get the conditional mean right. That is, your linear predictor should capture the relevant determinants of your outcome. As usual, this assumption cannot be tested (without more assumptions).

                        Comment


                        • #13
                          Agree with Daniel. You test functional form by including squares and interactions of some explanatory variables, and then compute the average partial (marginal) effects to see if they are sensitive.

                          The so-called RESET includes quadratics in the linear index, usually (x*bhat)^2, (x*bhat)^3. A joint test of these is a general functional form test. See the Papke and Wooldridge (1996) Journal of Econometrics paper for details.

                          BTW, this is one case where the robust standard errors (the correct ones) are almost always smaller than the nonrobust standard errors.

                          Comment


                          • #14
                            Thanks for all the answers so far, your tips were really helpful!
                            Just one more thing: When I run a fractional logit regression and look at the average marginal effects, how should I interpret these results?
                            Are they in percent or percentage points, (i.e. a 1% increase in x increases y by #percent or #percentage points?).
                            I've seen conflicting information in different articles which was quite confusing to me.

                            Kind regards
                            Lisa

                            Comment


                            • #15
                              Originally posted by lisa Uuuz View Post
                              Just one more thing: When I run a fractional logit regression and look at the average marginal effects, how should I interpret these results?
                              If you typed something like:

                              Code:
                              fracreg logit ...
                              margins , dyex(...)
                              margins will give you (semi-)elasticities, which can be interpreted as percent change in the predictors. Note that dydx() would not change the interpretation of the scale of your predictors at all; it merely approximates the linear change. Either way, your outcome is essentially a probability already, hence the change is in percentage points; unless, of course, you do eyex() or eydx().

                              I do not recall all the details from this thread but you could probably also look at exponentiated coefficients, which would represent odds ratios here.

                              Comment

                              Working...
                              X