How necessary is it to do regression diagnostics for a data report?

Cassie Wright

Join Date: Dec 2021

Posts: 44
#1

How necessary is it to do regression diagnostics for a data report?

24 Dec 2021, 06:21

Hello! And to some, Merry Christmas!

I am currently writing a data report on my modelling of the British Electorate in 2017 - N approx 2000. I am using OLS for my dependent ordinal on a 11 point scale variable (Prioritisation of Economy/Environment) - which I am treating continuous here - and my categorical variable education level - which I have divided into dummy variables.

Code:

regress dependentvariable i.independentcatvariable

This is my first data report so I'm worried I'm working hard in all the wrong places.

The burning question is, do I need to test for every OLS assumption (google says that there are 7)? Are there some that are more important and some that are not? In a previous post I asked specifically about formal tests of probability, and the general consensus I received was no, but perhaps do a qnorm plot to see.

Furthermore: what happens if some assumptions are violated and some are not? Do I just acknowledge this in my analysis or is it back to the drawing board? Is there any situation where perhaps I could say that my large sample size reduces concern on having violated these assumptions?

Lastly, should I present these assumptions before or after I present my regression table in my report?

Thank you again and I wish everyone a great winter holiday.

Edit: Unrelated, but as a general rule of thumb, is having more graphs a good thing in a data report?

Last edited by Cassie Wright; 24 Dec 2021, 06:31.
Tags: None
Matthew Alexander

Join Date: Feb 2021

Posts: 58
#2

24 Dec 2021, 07:50

Hi Cassie,
The answer to this question can largely be found in the various responses to your previous question. With a large sample, OLS is rather robust to the sorts of violations you're worried about. Since the effect mostly concerns standard error estimates, the use of robust standard errors will help protect against anti-conservative estimates. And this only matters if you are going to report standard errors/confidence intervals.
What I would say, though, is pay attention to your residuals. Plot both residuals versus predicted and residuals versus variable (especially continuous ones). Patterns or trends in residual plots indiciate problematic fit; perhaps an important variable has been omitted from your model or one of your variables is nonlinear with the DV. The solution to the former is self-evident, and the latter simply requires that you transform the variable so that it is linear (log or square transformations will usually suffice, but more complex procedures are available if needed).
More broadly, I totally get your mindset right now. You want to do things right, and according to google doing things right means ensuring that all these various assumptions are met. And to be sure they should be met, but minor violations here and there are n o t cause for great concern. That said, nonlinearity in a continuous predictor can dramatically alter interpretation of both the nonlinear parameter and others in the model. So look out for this above all else.
Finally, if your looking to present regression diagnostics in a paper or report of sorts, I would primarily focus on residuals, and perhaps something like leverage (to identify outliers) and or dfbeta /cooks d (to identify influential outliers). For example, say you identified nonlinearity in a continous variable and wanted the show the effect of transforming the parameter. You could perhaps plot the difference in residual between models, or maybe the difference in residual against the difference in leverage. All these diagnostic measures can be computed via - predict. Of course, if your model looks good, then just report that. Solid residuals via plot, low(ish) VIF, no highly influential outliers via plot, solid R-squared - and you're most probably good to go. So go!
All the very best,
Matt

Last edited by Matthew Alexander; 24 Dec 2021, 08:02.
1 like
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#3

24 Dec 2021, 07:57

Originally posted by Matthew Alexander View Post

Hi Cassie,
The answer to this question can largely be found in the various responses to your previous question. With a large sample, OLS is rather robust to the sorts of violations you're worried about. Since the effect mostly concerns standard error estimates, the use of robust standard erros will help protect against anti-conservative estimates. And this only matters if you are going to report standard erros/confidence intervals.
What I would say, though, is pay attention to your residuals. Plot both residuals versus predicted and residuals versus variable (especially continuous ones). Patterns or trends in residual plots indiciate problematic fit; perhaps an important variable has been omitted from your model or one of your variables is nonlinear with the DV. The solution to the former is self-evident, and the latter simply requires that you transform the variable so that it is linear (log or square transformations will usually suffice, but more complex procedures are available if needed).
More broadly, I totally get your mindset right now. You want to do things right, and according to google doing things right means ensuring that all these various assumptions are met. And to be sure they should be met, but minor violations here and there are n o t cause for great concern. That said, nonlinearity in a continuous predictor can dramatically alter interpretation of both the nonlinear parameter and others in the model. So look out for this above all else.
Finally, if your looking to present regression diagnostics in a paper or report of sorts, I would primarily focus on residuals, and perhaps something like leverage (to identify outliers) and perhaps dfbeta or cooks d (to identify influential outliers). For example, say you identified nonlinearity in a continous variable and wanted the show the effect of transforming the parameter. You could perhaps plot the difference in residual between models, or maybe the difference in residual against the difference in leverage. All these diagnostic measures can be computed via - predict. Of course, if your model looks good, then just report that. Solid residuals via plot, low(ish) VIF, no highly influential outliers via plot, solid R-squared - and you're most probably good to go.
All the very best,
Matt

Thank you for taking the time to respond with such a thorough and helpful explanation to my question. I really appreciate it! I hope you have a wonderful Christmas (if you celebrate it), and if not, I hope you have a great weekend.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#4

24 Dec 2021, 08:07

Cassie:
let's try to skim through all your points:
1) the first threat is endogeneity. It can take, in general, three forms:
a) latent variable: lurking within residuals, it is correlated to both regressand and one (or more) predictors (example: individual ability is correlated with both income [regressand] and education level [precitor]);
b) simultaneous equation (example: is price that determins quantity or the other way round? Actually, they are determined at the same time when demand and supply intersect):
c) reverse causation (example: a very low income can cause depression but the reverse causation can work as well).
Endogeneity may appear as an omitted variable (see -estat ovtest-; -linktest and -imtest-);
2) heteroskedasticity: already covered in your previous post-and-reply chains;
3) autocorrelation of the systematic -epsilon- error: can be managed with the -vce (cluster clusterid)- option;
4) you should report both regression and regression postestimation tests outcomes.

Merry Christmas

Last edited by Carlo Lazzaro; 24 Dec 2021, 08:36.

Kind regards,
Carlo
(Stata 19.0)
Comment
Matthew Alexander

Join Date: Feb 2021

Posts: 58
#5

24 Dec 2021, 08:21

Hi Cassie,
It's no problem at all.
I've been in your exact position, so I totally get that meeting all these requirements can seem a bit overwhelming. Carlo gives excellent advice - in particular, , I failed to mention the linktest as a means of asssessing whether you have omitted important variables. Also, I have found that the mutlivariable fractional polynomial procedure - implemented by - mfp - available from SSC - is a valuable and efficient tool for both assessing fit and significance, even more so if you are unceartain about your interpretation of your residuals.
Iif you have any further questions, feel free to shoot me a message.
Mery Christmas,
Matt

Last edited by Matthew Alexander; 24 Dec 2021, 08:24.
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#6

24 Dec 2021, 10:25

Originally posted by Matthew Alexander View Post

Hi Cassie,
It's no problem at all.
I've been in your exact position, so I totally get that meeting all these requirements can seem a bit overwhelming. Carlo gives excellent advice - in particular, , I failed to mention the linktest as a means of asssessing whether you have omitted important variables. Also, I have found that the mutlivariable fractional polynomial procedure - implemented by - mfp - available from SSC - is a valuable and efficient tool for both assessing fit and significance, even more so if you are unceartain about your interpretation of your residuals.
Iif you have any further questions, feel free to shoot me a message.
Mery Christmas,
Matt

Hi Matt! So many hours later (I'm going at a snails pace because I'm new to data analysis and stata) I have finally managed to do what you recommended. I hope you don't mind but I've got a few questions:

1. I've created a regression table for my analysis, and I'm wanting to include the standard errors and confidence intervals. Is there a particular way I can do this without stuffing my analysis with numbers? I'm just concerned this is not good etiquette in a report. Or is this just something that is inevitable considering I want to include said S.E. and CI? Apologies if this seems like a low effort question.

2. If I was to include a margins plot to show how the indicator variables - from my categorical variable - are not linear in relationship to my dependent variable, would I include this after my regression analysis? Or would this come before it? Again, apologies if this is really not something you can answer.

3. Can I do a nonlinear regression - with my independent categorical variable - using log or square transformations? Or is it only for continuous variables?

Thank you so much again for your help.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17673
#7

24 Dec 2021, 13:59

Cassie:
I do hope Matt does not mind if I chime in:
1) the way Stata -regress- table are displayed is a good example of how OLS results should be reported;
2) non-linear relationship with the regressand refer to continuos predictors, whereas -margin- mainly focuses on categorical predictors. The usual goal is to investigate whether turning points (max or min) do exist.
In this respect, both linear and squared terms should be included in the right-hand side of the regression equation;
3) logging categorical variables does not help. Conversely, you can log regressand and/or predictors, provided they are continuous. However, be careful that log-linear, linear-log and log-log OLS regressions imply different interpretatoon of the effect their coefficients on regressand.

Kind regards,
Carlo
(Stata 19.0)
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#8

24 Dec 2021, 16:39

Hey Cassie. And, Merry Christmas.

I'll just sort of extend what others have said, most of which is more stylistic than statistical.

Regarding the violations of OLS assumptions and whether or not you discuss them, my advice is that you format this according to whatever your assignment directs you to do. If it's for a class (as it seems) and the whole point is to show knowledge of OLS or make the case for its utility, then by all means make the case. If not, then don't unless you feel necessary for some reason. In this business, all your models will be incorrect and misspecified somehow. Only difference is finding the right tool for the right job and being transparent about it and how your took steps to correct for biases if you could at all. Knowing and discussing in detail what our estimators can and can't do is very important. As others have said, "There are no standard solutions, only standard problems" we face in some way. If we stopped at not being able to verify every assumption, nobody (least not me!) would begin a new project.

To your question on graphs... in my opinion, I always try as best as possible to give graphical results using the user written coefplot or something else. Graphics are generally the best way to report findings, and there are literally several dozen articles on the subject I could recommend if you want. If not, if you're not skilled enough with graphics yet, I would advise using the user written estout/esttab (but maybe I'm out of date, as other updates have been made since I never use tables pretty much). So I prefer well made and informative graphs.
1 like
Comment
Matthew Alexander

Join Date: Feb 2021

Posts: 58
#9

24 Dec 2021, 17:25

Hi Cassie,
Carlo pretty much covered everything that you need to know with his response. As he says, categorical/indicator variables don't require transformation since they cannot be nonlinear - so on that account you're pretty well set if none of your variables are continuous.
Elsewhere, marginsplot is generally used for post-estimation - that is, for showing the substantive effect of your predictors. Jared' s suggestion to use - coefplot - from SSC is a good one. In this way, you could show the coefficient or average marginal (discrete) effect of your iv, and that of the other predictors in the same plot. This is sort of the plot that would likely belong near the end of your analysis.
And to summarise what has been said above, if you look hard enough, you will likely find some assumption that, on the basis of some article or post, appears to have been violated by your model. As Jared implies, be open about this, but don't get too hung up on it, especially if the (prospective) fix requires more than time than you have.
All the very best,
Matt

Last edited by Matthew Alexander; 24 Dec 2021, 17:38.
1 like
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#10

27 Dec 2021, 06:14

Originally posted by Carlo Lazzaro View Post

Cassie:
I do hope Matt does not mind if I chime in:
1) the way Stata -regress- table are displayed is a good example of how OLS results should be reported;
2) non-linear relationship with the regressand refer to continuos predictors, whereas -margin- mainly focuses on categorical predictors. The usual goal is to investigate whether turning points (max or min) do exist.
In this respect, both linear and squared terms should be included in the right-hand side of the regression equation;
3) logging categorical variables does not help. Conversely, you can log regressand and/or predictors, provided they are continuous. However, be careful that log-linear, linear-log and log-log OLS regressions imply different interpretatoon of the effect their coefficients on regressand.

Thank you so much Carlo for this advice and you previous comment. I always find it so difficult to find a straight answer on google. Really appreciate the time you took and I hope you had a lovely Christmas!
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#11

27 Dec 2021, 06:21

Originally posted by Jared Greathouse View Post

Hey Cassie. And, Merry Christmas.

I'll just sort of extend what others have said, most of which is more stylistic than statistical.

Regarding the violations of OLS assumptions and whether or not you discuss them, my advice is that you format this according to whatever your assignment directs you to do. If it's for a class (as it seems) and the whole point is to show knowledge of OLS or make the case for its utility, then by all means make the case. If not, then don't unless you feel necessary for some reason. In this business, all your models will be incorrect and misspecified somehow. Only difference is finding the right tool for the right job and being transparent about it and how your took steps to correct for biases if you could at all. Knowing and discussing in detail what our estimators can and can't do is very important. As others have said, "There are no standard solutions, only standard problems" we face in some way. If we stopped at not being able to verify every assumption, nobody (least not me!) would begin a new project.

To your question on graphs... in my opinion, I always try as best as possible to give graphical results using the user written coefplot or something else. Graphics are generally the best way to report findings, and there are literally several dozen articles on the subject I could recommend if you want. If not, if you're not skilled enough with graphics yet, I would advise using the user written estout/esttab (but maybe I'm out of date, as other updates have been made since I never use tables pretty much). So I prefer well made and informative graphs.

So as you've might've guessed - I'm doing this for an assessment. The question is pretty straightforward, but we aren't told how to exactly format it. All I've had in my lectures is how to do regression - that's it. There's nothing about OLS assumptions etc. However, I am really interested in data analysis, and I want to understand OLS better, hence why I fell down the rabbit hole of "OLS assumptions".

I would love to know more about coefplot and if you have any articles you would recommend, please send them my way! I spend about 10 hours a day, 5 days a week just learning about data analysis and stata so I'm always happy to learn more! I've been practicing at graphs on stata, and I find myself regularly googling "beautiful graphs" to find good examples haha.

Thank you so much for your advice and time. I hope you had a wonderful Christmas (if that's something you celebrate).
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#12

27 Dec 2021, 09:43

Originally posted by Matthew Alexander View Post

Hi Cassie,
Carlo pretty much covered everything that you need to know with his response. As he says, categorical/indicator variables don't require transformation since they cannot be nonlinear - so on that account you're pretty well set if none of your variables are continuous.
Elsewhere, marginsplot is generally used for post-estimation - that is, for showing the substantive effect of your predictors. Jared' s suggestion to use - coefplot - from SSC is a good one. In this way, you could show the coefficient or average marginal (discrete) effect of your iv, and that of the other predictors in the same plot. This is sort of the plot that would likely belong near the end of your analysis.
And to summarise what has been said above, if you look hard enough, you will likely find some assumption that, on the basis of some article or post, appears to have been violated by your model. As Jared implies, be open about this, but don't get too hung up on it, especially if the (prospective) fix requires more than time than you have.
All the very best,
Matt

If I'm using a marginsplot, or coefplot, for post estimation, how should I explain it? Apologies if this is an obvious question. Do I say what the coefficient is for each indicator variable, say whether or not it's disproved my hypothesis and leave it at that?
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#13

27 Dec 2021, 09:45

Originally posted by Carlo Lazzaro View Post

Cassie:
I do hope Matt does not mind if I chime in:
1) the way Stata -regress- table are displayed is a good example of how OLS results should be reported;
2) non-linear relationship with the regressand refer to continuos predictors, whereas -margin- mainly focuses on categorical predictors. The usual goal is to investigate whether turning points (max or min) do exist.
In this respect, both linear and squared terms should be included in the right-hand side of the regression equation;
3) logging categorical variables does not help. Conversely, you can log regressand and/or predictors, provided they are continuous. However, be careful that log-linear, linear-log and log-log OLS regressions imply different interpretatoon of the effect their coefficients on regressand.

I have a question and I apologise if it's an obvious one. How do I find out whether the turning points exist? Are these just the coefficients that come up when I type - margins independent variable - after my regression? Or is this something I can interpret from the graph?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17673

#14

27 Dec 2021, 10:16

Cassie:
I do hope you had a lovely Christmas too.
Sorry for my late reply, but as the year comes to its end, I've to skimm through tons of red tape (that I deliberately put off before).
The issue boils down to a quadratic equation (parabola).
In the following toy-example:

Code:

. use "https://www.stata-press.com/data/r17/nlswork.dta"
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. xtreg ln_wage c.age##c.age

Random-effects GLS regression                   Number of obs     =     28,510
Group variable: idcode                          Number of groups  =      4,710

R-squared:                                      Obs per group:
     Within  = 0.1087                                         min =          1
     Between = 0.1015                                         avg =        6.1
     Overall = 0.0870                                         max =         15

                                                Wald chi2(2)      =    3388.51
corr(u_i, X) = 0 (assumed)                      Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     ln_wage | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0590339   .0027172    21.73   0.000     .0537083    .0643596
             |
 c.age#c.age |  -.0006758   .0000451   -15.00   0.000    -.0007641   -.0005876
             |
       _cons |   .5479714   .0397476    13.79   0.000     .4700675    .6258752
-------------+----------------------------------------------------------------
     sigma_u |   .3654049
     sigma_e |  .30245467
         rho |  .59342665   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. di .0590339/(-(2* -.0006758))
43.677049

. sum age

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |     28,510    29.04511    6.700584         14         46

.

the squared term is telling us that there's evidence of a quadratic relationship between -ln_wage- and -age- (a maximum in fact. as the sign of the squared term coefficient is negative).
To check whether the maximum is included in the range of -age- we -summarize- the predictor and the result support the evidence of the abovementioned turning point.
Therefore, you can investigate whether a turning point does exist just plugging bith the linear and the squared termes in the righ-hand side of your regerssin equation.
If they are both significant and the value of the turning point falls within the range of your predictor, you do have a maximum or a minimum (depending on the sign of the squared term coefficient being < or > 0).

Kind regards,
Carlo
(Stata 19.0)

Comment

Matthew Alexander

Join Date: Feb 2021

Posts: 58
#15

27 Dec 2021, 16:51

Originally posted by Cassie Wright View Post

If I'm using a marginsplot, or coefplot, for post estimation, how should I explain it? Apologies if this is an obvious question. Do I say what the coefficient is for each indicator variable, say whether or not it's disproved my hypothesis and leave it at that?

Hi Cassie,
The general approach to presenting the results of these kind of plots would be to
a) highlight the coefficient/marginal estimate of the main effect in terms of direction (positive or negative), strength and statistical significance (is p < .05)
b) compare the coefficient/marginal estimate of the main effect to that of your controls. Perhaps there is one particular control that, for theoretical reasons, is a natural comparison. Or you may find that when you graph the data, one particular control estimate stands out in some way as an informative point of comparison. I would say, though, that method a - using a priori comparison points - is the more theoretically defensible approach.

In terms of your hypothesis, the direction, strength and statistical significance of your main effect should, together, inform the bulk of your conclusion. If your hypothesis is that x has a strong, positive effect on y, and the coefficient for x is positive, strong and statistically significant, then your findings can be used as evidence to support your hypothesis. And, if the coefficient of x is, say, stronger or at least as strong as a number of controls with established association with y, then you may use this finding to further underline the association between x and y in comparative terms.

Of course, be careful to not overstate the extent to which your findings support your hypothesis (if they do). Inevitably, you model will be limited in some way. You need to acknowledge this, and briefly explain how these limitations may have affected your findings and thus the reliability and/or generalisability of your conclusions. Also, by limitations I do not just mean things largely outside of your control e.g. omitted variables, sample bias. But also things within your control that you have not accounted for because they are beyond the scope of your analysis e.g. variability between sub-populations or variability due to the potential dependency of the effect of x on the effect of one or more of your controls (aka interaction effects).

Hope you had a smashing Christmas,
Matt

Last edited by Matthew Alexander; 27 Dec 2021, 17:15.
Comment

Announcement