Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Two-way ANOVA or a hierarchical multiple linear regression? and outliers

    Dear forum,

    I have two questions that I am struggling with. One is about what model I should use in Stata and two is about outlier problems with each model.
    I'm a bit of a beginner but I will try to explain it to the best of my ability.

    I compiled a dataset of around a 130 CEO successions of about 120 different companies. I'm wanting to test the relationship between post-succession ROA and CEO type in moderation of board composition. So, in short a moderation relationship.

    DV: post-succession ROA (continuous)
    IV1: CEO type (nominal with 3 types)
    IV2: Board composition (nominal with 2 types)

    I have several other control variables,
    - Year (nominal)
    - Board size (continuous)
    - pre-succession ROA (continuous)
    - Industry SIC 2-digit (nominal)

    My first question is whether I should be doing a two-way ANOVA or a hierarchical multiple linear regression, or perhaps another model?
    I tried to do the two-way ANOVA but got stuck on the outliers assumption. I could not decide if I need to remove my outliers or not, and do not know how transforming my data will affect my research. I installed the extremes tool to identify my outliers. I also used it in combination with the scatter command to assess the outliers better. I do not know if my outliers are significant and struggle on removing them or not, or let alone dealing with them.

    For the hierarchical regression, I think this is simply a multiple linear regression in which I test three different equations. The first is the DV, IV1 and IV2, the second includes the interaction between IV1 and IV2, the third includes the controls. If I am wrong, please correct me.
    The dwstat command brings me 2.1 so i believe to fulfill the independence assumption.
    Linearity is tested for using a twoway scatter with lfit.
    Here I am also wondering about outliers as some IVs or controls show widely deviating results, making me wonder if I can tell that linearity exists. Also, linearity is automatically met for categorical dummies right? I read this somewhere and wonder if it's true.

    Thank you for your time and hope you can help me

  • #2
    In case it helps to answer my quesion, I could share the data using dataex

    Comment


    • #3
      You are dumping a whole study into a listserv. There are lots of questions here, maybe too many for a listserv setting. Nothing in your prep sounds off (interesting actually), but it does sound like you have a model more complicated than your experience level with similar models? Absolutely nothing wrong with that, I just think you will get more mileage from a sit down with a statistician or econometrician.

      Moderation can be tested by interaction, yes. It also does sound like one of the xt or panel data commands could be part of your solutions. You can use anova but don't in this context. Use regression or a panel data regression model. You need to start with simpler models and build your way to more complex models, perhaps producing margins plots or plots of model predictions to understand what the model is doing. Outliers are typically looked at in the context of a model, but you can also look at them variable by variable or as multivariate outliers. Models typically have a host of assumptions that you may want graphs or tests to evaluate each with.

      Comment


      • #4
        Dear Dave,

        Thank you for your reply. I'd wish to have a sitdown but unfortunately I do not know a professor at my university who could be available. Nonetheless, I'll try to make the most of it.
        From your response I will stick to regressing my data using multiple linear regression. I think my data is not structured as panel data. And I thought that xtset was only meant for panel data. I'm not sure.

        About my data, the control YEAR is only present to account for year fixed effects. The data therefore does not include any years in which no CEO successions took place. However, ROA is taken as 3-year averages before and after the succession, so a certain time element is present.

        At this point, I think it is best to share my data with you to improve our conversation. Its the first 20 observations. I removed the company names variable as I think I'm not allowed to share that.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int YEAR byte(DUAL BSIZE) double B_INS byte(BTYPE CTYPE) double(PRE_ROA POST_ROA) long SALES byte SIC2 float(FSIZE LSALES)
        2007 0  8              .375 0 1   .17606826423766586  .14884545826271098  7493520 27  15.82955  15.82955
        2007 0 10                .4 0 1    -.680418365231353   .2503206061462756  2407170 28 14.693962 14.693962
        2007 0 11 .2727272727272727 0 3  -.11191641273080939  .13209086440928794 14474280 28 16.487885 16.487885
        2008 0  8              .125 0 1  .059479414170512175   .1903741075077058   206130 28 12.236262 12.236262
        2007 0  7 .2857142857142857 0 3  .026865259690350498 -.00321642565911325  3428340 35 15.047586 15.047586
        2008 0  6 .3333333333333333 0 1  -.06682236643692334  -.5178135323261129  2393030 36  14.68807  14.68807
        2007 0  7 .2857142857142857 0 1 .0032080218032827404 -.10856845717065468   465270 36 13.050373 13.050373
        2005 0  8                 0 1 1   -.2525682835847794  .20809135010036087 51390000 38 17.754953 17.754953
        2009 0  9 .1111111111111111 0 2   2.9975395803475595   .1110348008984093 11719020 60 16.276724 16.276724
        2005 0  8               .25 1 1   .16252462319546235 -.19194964942522288 12025950 73 16.302578 16.302578
        end


        I am aware of the assumptions that I need to meet. Currently I have independence of observations, according to dwstat.
        Linearity is a bit of a struggle. I can see that there are consistent lines between my continuous variables, yet also that some observations are far off the line, hence the outliers. My first guess was to simply delete these observations as I think some data was simply off. I however do not have a way of checking whether these observations are actually bogus or just extreme real cases. For instance, if performance of a company is very low, than it could be that its making a gigantic loss, causing an outlier in my dataset.

        I do not clearly understand what you mean with starting with simpler models. I planned to first start with a simple model and then to build of it. The models that I hope to run are as follows:

        1. reg POST_ROA i.CTYPE i.BTYPE
        2. reg POST_ROA i.CTYPE i.BTYPE i.CTYPE#i.BTYPE
        3. reg POST_ROA i.CTYPE i.BTYPE i.CTYPE#i.BTYPE i.YEAR c.LSALES c.BSIZE i.SIC2 i.DUAL

        This third and full economic equation would be:
        Post-succession ROA = beta(constant) + beta2(CEO type) + beta3(Board type) + beta4(interaction of CTYPE and BTYPE) + beta(controls such as year fixed-effects, firm-size as log sales, board size, industry at the 2-digit SIC level and CEO duality) + the error term.

        Maybe this helps.

        How would you advise me to continue? Should I be using xtset? what to do with outliers that are realistic?

        Thanks again

        Comment


        • #5
          First, you need to understand that two-way ANOVA and MR have exactly the same statistical basis -- least squares. You will get the same results from your second regression model as you would from a two way ANOVA with interactions. You might run it both ways to convince yourself that this is true. You will find that MR is much more flexible and yields additional information. You are correct that categorical variables such as CEO type (i.CTYPE) are "automatically" taken care of it by your model specification, but you need to be sure you understand how to interpret the output. As to outliers, you might think in terms of assessing how they affect your results. Look at the help file for regress postestimation, particularly the dfbeta statistic.
          Richard T. Campbell
          Emeritus Professor of Biostatistics and Sociology
          University of Illinois at Chicago

          Comment


          • #6
            There's no stats clinic at your U?

            A model like:

            reg POST_ROA i.CTYPE##i.BTYPE i.YEAR

            with year entered without interaction with year means that the other effects in the model change in a constant manner year over year. For example, you are allowing an intercept only change in the i.CTYPE#i.BTYPE interaction from year to year. The shape of the interaction is constant over levels of year, even if that interaction might shift up or down with year. This is very different from allowing CTYPE or BTYPE to interact with time, or their interaction to interact with time. Have you made the choice to include additive covariate deliberately and with some care? Plotting your model predictions helps understand how the model is restricting fit. Try plotting the interaction over time from your model. Are you going to evaluate the model fit of the covariates included in the way you have? Might there also be nonlinear fits for your continuous covariates? This is what I mean by building a model up. I'm not suggesting you overly complicate your model, just that it might require some extra work.

            Comment


            • #7
              Dear Dick and Dave,

              Thank you for both of your replies. Sorry for the late reply as I did not accurately understand what to do next. The suggestion for checking DFBETA helped a lot. Thank you. I have found to further understand it using the following (in case someone may need it in the future) link: https://stats.idre.ucla.edu/stata/we...n-diagnostics/

              I have chosen to delete some of my observations as they had high values when testing for the student residuals, leverage, Cook's D, FITS, and DFBETA. My justification was that the outlier firms who had the resultant levels of performance were likely documented incorrectly. This wasn't so strange as more mistakes were present in the database. If it however was not due to measurement error than I reasoned that leaving the extreme outliers in would not allow me to study the generic effects of CEO succession as 93% of my observations would be affected by 7%.

              I hope this gives enough reason to drop the extreme outliers that I had found. There were also other outliers but they were quite mild in comparison, hence they remained in the sample.

              My question is whether this is correct from a research point of view?

              Kind regards,
              Warner



              Comment


              • #8
                If you have good reason to believe that the observations you deleted were erroneous it is certainly better to remove them than to leave them in. Of course you will want to document what you did in any papers you submit for publication. Rather than deleting the entire observation you might see this as a missing data problem and think about using appropriate imputation methods as documented in Stata's multiple imputation (MI) manuals. You've indicated you are a "bit of a beginner," and you have to decide how much you want to invest in learning additional methods at this point.
                Richard T. Campbell
                Emeritus Professor of Biostatistics and Sociology
                University of Illinois at Chicago

                Comment

                Working...
                X