Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Residuals in q-q plot for a simple regression do not follow a 45 degree angle

    Hi All,

    When performing a simple regression of a binary variable on a continuous variable, I checked the distribution of my residuals using the following two commands (as suggested by https://www.statology.org/qq-plots-stata/):

    predict resid_varname_w, residuals
    qnorm resid_varname_w

    My variable is winsorized, so I think that explains the results at the tail ends. I tried various transformations (f.e. taking the square root), however this did not improve the result much. I have this problem in four regressions, as shown below:
    Click image for larger version

Name:	Picture1.png
Views:	1
Size:	58.4 KB
ID:	1722746 Click image for larger version

Name:	Picture2.png
Views:	1
Size:	64.5 KB
ID:	1722747 Click image for larger version

Name:	Picture3.png
Views:	1
Size:	59.9 KB
ID:	1722748 Click image for larger version

Name:	Picture4.png
Views:	1
Size:	53.0 KB
ID:	1722749

    All four dependent variables have means between -0.5 and 3.5 and std dev of between 0.2 and 3. It would be really helpful to get some suggestions on how to improve the results.

    Cheers,

    Aron


    Last edited by Aron Polour; 03 Aug 2023, 08:01.

  • #2
    I can't follow easily what you've done.

    a simple regression of a binary variable on a continuous variable
    So, the outcome is binary and the predictor is continuous? Or the other way round?


    My variable is winsorized
    That's the continuous variable? How much Winsorization?

    Code:
    I have this problem in four regressions
    I think this means the same continuous variable, sometimes as it came and sometimes transformed.

    What can be said easily is that the residual plots show that you are a long way from normal distribution whatever you do. But that isn't absolutely fatal to the idea of regression, as normality of residuals (or errors) is just about the least important ideal condition for regression (or assumption as many people persist in saying). But the plots aren't encouraging either. You have clutches of residuals with more or less the same value, which implies lots of original data that are identical or very similar.

    I am going to stop guessing there. It would be immensely more revealing to see scatter plots of your original data; confirmation of what you want to predict from what; and a data example. Telling us what the variables mean would help too.

    Comment


    • #3
      Dear Nick,

      Thanks for your reply.

      To clarify things: I am looking into the effect of COVID (binary variable) on certain company characteristics (outcome variables) such as Net Profit Margin (continuous variable). Due to the large extreme values NPM can take, I winsorized this variable where every observation in the 5 percentile tail ends converts to either the 5th or the 95th percentile value. My dataset contains about 300.000 observations. In order to establish the causal effect I add control variables, however although this improved the result slightly, it did not become anything like what it is supposed to be. What I meant with four regressions is that I look for the COVID effect on four different company characteristics.

      Thanks again for taking the time to read through my question. I am also intrigued by your remark that normality of residuals is ‘least important ideal condition for regression’. Could you elaborate on this a bit more?
      Last edited by Aron Polour; 07 Aug 2023, 06:47.

      Comment


      • #4
        Working backwards: the point about the normality ideal condition (assumption) is elaborated on at enormous length in every decent regression or econometrics text. For example, I recommend Jeff Wooldridge's introductory econometrics text for anyone who is a worker in economics or finance. I am neither, but I wouldn't expect anything but very weak relationships between Covid (as a binary predictor) and any economic or financial outcome measure.

        It sounds as if you're using some kind of panel model and if so it is vital to know what it is.

        I don't see any graphs in your answer, or a data example, so although you've helpfully provided much extra detail, I don't have different advice as yet. Others may want to add to this small answer.
        Last edited by Nick Cox; 07 Aug 2023, 07:36.

        Comment

        Working...
        X