Two-way ANOVA or a hierarchical multiple linear regression? and outliers

Warner de Jong

Join Date: May 2017

Posts: 19
#1

Two-way ANOVA or a hierarchical multiple linear regression? and outliers

25 May 2017, 06:23

Dear forum,

I have two questions that I am struggling with. One is about what model I should use in Stata and two is about outlier problems with each model.
I'm a bit of a beginner but I will try to explain it to the best of my ability.

I compiled a dataset of around a 130 CEO successions of about 120 different companies. I'm wanting to test the relationship between post-succession ROA and CEO type in moderation of board composition. So, in short a moderation relationship.

DV: post-succession ROA (continuous)
IV1: CEO type (nominal with 3 types)
IV2: Board composition (nominal with 2 types)

I have several other control variables,
- Year (nominal)
- Board size (continuous)
- pre-succession ROA (continuous)
- Industry SIC 2-digit (nominal)

My first question is whether I should be doing a two-way ANOVA or a hierarchical multiple linear regression, or perhaps another model?
I tried to do the two-way ANOVA but got stuck on the outliers assumption. I could not decide if I need to remove my outliers or not, and do not know how transforming my data will affect my research. I installed the extremes tool to identify my outliers. I also used it in combination with the scatter command to assess the outliers better. I do not know if my outliers are significant and struggle on removing them or not, or let alone dealing with them.

For the hierarchical regression, I think this is simply a multiple linear regression in which I test three different equations. The first is the DV, IV1 and IV2, the second includes the interaction between IV1 and IV2, the third includes the controls. If I am wrong, please correct me.
The dwstat command brings me 2.1 so i believe to fulfill the independence assumption.
Linearity is tested for using a twoway scatter with lfit.
Here I am also wondering about outliers as some IVs or controls show widely deviating results, making me wonder if I can tell that linearity exists. Also, linearity is automatically met for categorical dummies right? I read this somewhere and wonder if it's true.

Thank you for your time and hope you can help me
Tags: None
Warner de Jong

Join Date: May 2017

Posts: 19
#2

25 May 2017, 06:39

In case it helps to answer my quesion, I could share the data using dataex
Comment
Dave Airey

Join Date: Apr 2014

Posts: 396
#3

25 May 2017, 07:09

You are dumping a whole study into a listserv. There are lots of questions here, maybe too many for a listserv setting. Nothing in your prep sounds off (interesting actually), but it does sound like you have a model more complicated than your experience level with similar models? Absolutely nothing wrong with that, I just think you will get more mileage from a sit down with a statistician or econometrician.

Moderation can be tested by interaction, yes. It also does sound like one of the xt or panel data commands could be part of your solutions. You can use anova but don't in this context. Use regression or a panel data regression model. You need to start with simpler models and build your way to more complex models, perhaps producing margins plots or plots of model predictions to understand what the model is doing. Outliers are typically looked at in the context of a model, but you can also look at them variable by variable or as multivariate outliers. Models typically have a host of assumptions that you may want graphs or tests to evaluate each with.
1 like
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#4

25 May 2017, 08:02

Dear Dave,

Thank you for your reply. I'd wish to have a sitdown but unfortunately I do not know a professor at my university who could be available. Nonetheless, I'll try to make the most of it.
From your response I will stick to regressing my data using multiple linear regression. I think my data is not structured as panel data. And I thought that xtset was only meant for panel data. I'm not sure.

About my data, the control YEAR is only present to account for year fixed effects. The data therefore does not include any years in which no CEO successions took place. However, ROA is taken as 3-year averages before and after the succession, so a certain time element is present.

At this point, I think it is best to share my data with you to improve our conversation. Its the first 20 observations. I removed the company names variable as I think I'm not allowed to share that.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int YEAR byte(DUAL BSIZE) double B_INS byte(BTYPE CTYPE) double(PRE_ROA POST_ROA) long SALES byte SIC2 float(FSIZE LSALES) 2007 0 8 .375 0 1 .17606826423766586 .14884545826271098 7493520 27 15.82955 15.82955 2007 0 10 .4 0 1 -.680418365231353 .2503206061462756 2407170 28 14.693962 14.693962 2007 0 11 .2727272727272727 0 3 -.11191641273080939 .13209086440928794 14474280 28 16.487885 16.487885 2008 0 8 .125 0 1 .059479414170512175 .1903741075077058 206130 28 12.236262 12.236262 2007 0 7 .2857142857142857 0 3 .026865259690350498 -.00321642565911325 3428340 35 15.047586 15.047586 2008 0 6 .3333333333333333 0 1 -.06682236643692334 -.5178135323261129 2393030 36 14.68807 14.68807 2007 0 7 .2857142857142857 0 1 .0032080218032827404 -.10856845717065468 465270 36 13.050373 13.050373 2005 0 8 0 1 1 -.2525682835847794 .20809135010036087 51390000 38 17.754953 17.754953 2009 0 9 .1111111111111111 0 2 2.9975395803475595 .1110348008984093 11719020 60 16.276724 16.276724 2005 0 8 .25 1 1 .16252462319546235 -.19194964942522288 12025950 73 16.302578 16.302578 end

I am aware of the assumptions that I need to meet. Currently I have independence of observations, according to dwstat.
Linearity is a bit of a struggle. I can see that there are consistent lines between my continuous variables, yet also that some observations are far off the line, hence the outliers. My first guess was to simply delete these observations as I think some data was simply off. I however do not have a way of checking whether these observations are actually bogus or just extreme real cases. For instance, if performance of a company is very low, than it could be that its making a gigantic loss, causing an outlier in my dataset.

I do not clearly understand what you mean with starting with simpler models. I planned to first start with a simple model and then to build of it. The models that I hope to run are as follows:

1. reg POST_ROA i.CTYPE i.BTYPE
2. reg POST_ROA i.CTYPE i.BTYPE i.CTYPE#i.BTYPE
3. reg POST_ROA i.CTYPE i.BTYPE i.CTYPE#i.BTYPE i.YEAR c.LSALES c.BSIZE i.SIC2 i.DUAL

This third and full economic equation would be:
Post-succession ROA = beta(constant) + beta2(CEO type) + beta3(Board type) + beta4(interaction of CTYPE and BTYPE) + beta(controls such as year fixed-effects, firm-size as log sales, board size, industry at the 2-digit SIC level and CEO duality) + the error term.

Maybe this helps.

How would you advise me to continue? Should I be using xtset? what to do with outliers that are realistic?

Thanks again
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#5

25 May 2017, 09:28

First, you need to understand that two-way ANOVA and MR have exactly the same statistical basis -- least squares. You will get the same results from your second regression model as you would from a two way ANOVA with interactions. You might run it both ways to convince yourself that this is true. You will find that MR is much more flexible and yields additional information. You are correct that categorical variables such as CEO type (i.CTYPE) are "automatically" taken care of it by your model specification, but you need to be sure you understand how to interpret the output. As to outliers, you might think in terms of assessing how they affect your results. Look at the help file for regress postestimation, particularly the dfbeta statistic.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
Comment
Dave Airey

Join Date: Apr 2014

Posts: 396
#6

25 May 2017, 12:15

There's no stats clinic at your U?

A model like:

reg POST_ROA i.CTYPE##i.BTYPE i.YEAR

with year entered without interaction with year means that the other effects in the model change in a constant manner year over year. For example, you are allowing an intercept only change in the i.CTYPE#i.BTYPE interaction from year to year. The shape of the interaction is constant over levels of year, even if that interaction might shift up or down with year. This is very different from allowing CTYPE or BTYPE to interact with time, or their interaction to interact with time. Have you made the choice to include additive covariate deliberately and with some care? Plotting your model predictions helps understand how the model is restricting fit. Try plotting the interaction over time from your model. Are you going to evaluate the model fit of the covariates included in the way you have? Might there also be nonlinear fits for your continuous covariates? This is what I mean by building a model up. I'm not suggesting you overly complicate your model, just that it might require some extra work.
Comment
Warner de Jong

Join Date: May 2017

Posts: 19
#7

29 May 2017, 13:36

Dear Dick and Dave,

Thank you for both of your replies. Sorry for the late reply as I did not accurately understand what to do next. The suggestion for checking DFBETA helped a lot. Thank you. I have found to further understand it using the following (in case someone may need it in the future) link: https://stats.idre.ucla.edu/stata/we...n-diagnostics/

I have chosen to delete some of my observations as they had high values when testing for the student residuals, leverage, Cook's D, FITS, and DFBETA. My justification was that the outlier firms who had the resultant levels of performance were likely documented incorrectly. This wasn't so strange as more mistakes were present in the database. If it however was not due to measurement error than I reasoned that leaving the extreme outliers in would not allow me to study the generic effects of CEO succession as 93% of my observations would be affected by 7%.

I hope this gives enough reason to drop the extreme outliers that I had found. There were also other outliers but they were quite mild in comparison, hence they remained in the sample.

My question is whether this is correct from a research point of view?

Kind regards,
Warner
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#8

29 May 2017, 14:17

If you have good reason to believe that the observations you deleted were erroneous it is certainly better to remove them than to leave them in. Of course you will want to document what you did in any papers you submit for publication. Rather than deleting the entire observation you might see this as a missing data problem and think about using appropriate imputation methods as documented in Stata's multiple imputation (MI) manuals. You've indicated you are a "bit of a beginner," and you have to decide how much you want to invest in learning additional methods at this point.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
Comment

Announcement

Two-way ANOVA or a hierarchical multiple linear regression? and outliers

Comment

Comment

Comment

Comment

Comment

Comment

Comment