Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • panel data outliers detection useful commands

    Hello Statalist community,
    I hope all of you are fine.
    I am running a panel data analysis to study the effect of some bank-specific variables on NPL (non-performing loans listed on banks' balance sheets). I am carrying out Fixed effect and random effects estimation by using the command xtreg with 109 cross-sectional units adopting the vce (cluster id) specification to deal with the estimation biases that may result.
    However, I would like to carry out some post-estimation analysis and better "shape" my dataset, by which I mean get rid of some outliers in my cross-sectional units.
    I have read papers and previously published discussions here on the forum and it emerges that there is no clear consensus about how to detect outliers and crucial is the type of data are you working with in order to select once you discovered an outlier how to deal with it. The "sensibility" of the researcher is of great importance as well.
    My request is the following:
    What Stata commands are considered the most appropriate to detect outliers?
    Scatterplots between two variables (independent and dependent)? Summary statistics showing quartile values?
    I know maybe considering the nature of my request you may be of limited help, but still, I would like to hear different opinions.
    I must say that I took notes of some cross-sectional units that for a given variable present a diverging trend compared to "that" is prevalent and dominant among the other cross-section units. I did this in excel.
    I don't think scatterplots can be very helpful to allow me to detect "different" or "suspicious" trends (it does not matter how we name it as long as we both agree that present a trait of divergence compared to the main observed path).

    For this reason, what Stata commands and graphs would you suggest I should use? The purpose is to get some hints and pieces of advice as I am not expecting any "right" answers.

    Thanks to everybody,
    Greetings,

  • #2
    Since you want a variety of opinions, I'll share my own views, which I think are somewhat extreme, but I believe they are quite well justified.

    The only fully legitimate reason to exclude an outlier is if it is a data error. And, indeed, when I begin cleaning data sets prior to analysis, I look for values that are outside the range of the possible or are suspicious. For example (I'm an epidemiologist) I will reject a diastolic blood pressure of 10 if the data are from an outpatient setting because free-living survival with a diastolic blood pressure that low is not possible. I would not reject that same diastolic blood pressure in an intensive care unit, where it probably foretells a poor outcome, but is a biologic possibility. Returning to the outpatient setting, if I see a diastolic blood pressure of 45, I will flag it as questionable and contact the people who collected the data to ask them to verify if it was correct. But if they respond that, yes, that was really the value, then it stays.

    The common practice of excluding outliers just because they are out of a desired range I find problematic. For reasons that depend on the role of the variable in question in the study.

    Excluding outliers on explanatory variables (predictors or covariates) limits the generalizability of your study findings. You can not apply any conclusions you draw from your study to other cases where an outlying value is realized. Now, it may be that beyond a certain range of values, the necessary assumptions of your modeling break down. For example, linearity of relationship to outcome may fail outside some range. Fine. But then you have to accept that your model only applies within that range. And if possible, the best science would be to then continue to develop a different (but, presumably, related) model that does work with the outlying values. Or perhaps a transformation of the outcome variable will enable you to incorporate all of the data in a single model. If none of that is not possible, then I would reluctantly exclude the outliers and live (unhappily) with a model of restricted applicability.

    Far worse is excluding outliers on the outcome variable of a study. Your predictive model, no matter how well structured in other respects, becomes useless because you cannot know to whom it applies. It makes no sense to say I am going to predict your five year risk of heart attack, but my prediction only works if your risk within a certain range. Because you do not know in advance whether the risk is in that range or not, such a prediction is of no use to anybody.

    Worse still, the exclusion of outliers on outcome variables is often undertaken for reasons that are based on statistical myths. For example, people sometimes do this because they want their outcome variable to have a normal distribution. But that is just wrong: linear regressions do not require the outcome variable to have a normal distribution. At most they require the residuals to have a normal distribution, and even that is only necessary with small samples; for large samples the central limit theorem rescues the statistical inferences regardless. Frankly, in the contemporary real world, if one is working with a sample small enough that the normality of residuals really matters to inference in a linear regression, the study is probably a waste of effort any way: any effect large enough to be detected in a sample that small is likely to have been confirmed scientifically decades ago, and may well have been folklore for centuries.

    Comment


    • #3
      Salvatore:
      another hint (that partially overlaps Clyde's excellent one) rests on the data generating process you're investigating along with its predictors.
      For instance, healthcare costs follow a Gamma distribution (positively skewed), as there are patients that unfortinately pass away after the first administration of a given drug (left tail) and others who, due to repeated adverse events which are also costly to manage, will need inward hospitalization and tons of medications, tests and specialist visits (right tail). This is a matter of fact and right-tail patients are by no means outliers.
      Obviously, we are far from normality here, but reality is not forced to be normal; it is simply what it is, and we can omly approximate it with the best fitting theoretical probability distribution (if any).
      Last edited by Carlo Lazzaro; 08 Aug 2022, 03:54.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        To Carlo & Clyde:
        Thanks to both of you for the rich insights you provided me with. I am sure that these will significantly enrich my background and be helpful for my future work.
        Thanks

        Comment


        • #5
          Hi, I'm running fixed-effects IV regressions to identify the causal effects of housing costs on individual health. I want to identify outliers in the housing cost measure, based on second stage residuals.

          How would I identify the residual mean, to determine residuals outside 3 SD of the residual mean? The command I'm currently using for the IV regressions is xtivreg28.

          Thank you.

          Comment

          Working...
          X