Outliers with non-normally distributed panel data

Mike Rouven

Join Date: Mar 2021

Posts: 24
#1

Outliers with non-normally distributed panel data

03 Mar 2021, 07:39

Hello,

I have an unbalanced panel data set where my variables are not normally distributed. At the same time I have a variable with outliers. These outliers have values that definitely do not make sense and from my point of view represent input errors in the data that I cannot correct afterwards. I want to identify the outliers and then exclude them from my calculation. Due to the fact that my varaibles are not normally distributed, I cannot use many common methods to identify and handle outliers. Thus, I have tried the methods "median of absolute differences (mad) and "double mad". However, from my point of view, this excluded too many cases that are not outliers. Are there other methods that I can use here?

I use Stata 14.2 and here is also some information about the variable called "sales":

tabstat sales

qnorm sales

graph box sales

Thanks for the support.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

03 Mar 2021, 09:21

Two quite different kinds of issue are tangled together here. Whether a value like 1.18 billion is utterly wrong is a substantive issue on which we can't comment. But I don't see anything obviously pathological in the display. You'd learn more from quantile sales, ysc(log) (which will need some work on the axis labels).
Comment
Mike Rouven

Join Date: Mar 2021

Posts: 24
#3

03 Mar 2021, 22:49

Thank you very much for your reply. I have implemented your hint and got out the following graph:

quantile Sales, rlopts(connect(ascending)) yscale(log) ylabel(minmax) xmtick(minmax)

If I understand the graph correctly, there are anomalies in the data in the first and last quantiles. In particular, I would like to deal with the supposed outliers in the first quantile and unified values in the last quantile. The only problem is the non-normal distribution of the data.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

04 Mar 2021, 01:22

I would never expect sales to be normally distributed.

Identifying outliers that make no substantive sense is a substantive issue, and you need substantive knowledge to make an informed decision. But I will guess that this is panel data, say sales for several different firms over several years. If so, you could plot the time series for say the firms with the lowest medians and the highest medians against time as a further check.

There isn't a general purpose method that will reliably identify which data points are bad and should be ignored.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

04 Mar 2021, 04:07

(The thread title does say that this is panel data!)
Comment
Mike Rouven

Join Date: Mar 2021

Posts: 24
#6

04 Mar 2021, 05:52

There isn't a general purpose method that will reliably identify which data points are bad and should be ignored.

As stupid as it sounds, this is a great insight for me, thank you very much! I always thought that in statistics all paths are pre-drawn. The fact that I can proceed independently based on the theory and data is wonderful.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

04 Mar 2021, 06:43

For more on what I think see if you wish e.g. https://stats.stackexchange.com/ques...iers-with-mean !!! (The answers there range wider than the question.)

To be fair, there isn't one united view on all this. There are fields where the line is that a small fraction of a big dataset is just bad or least freakish and we can't possibly drill down to see what is genuine, so just throw out the outliers.

As a geographer my experience is mostly the opposite: the outlier is a big flood, or a big glacier, or the Amazon, or Amazon, or the populations of China and India, and almost always very real.
Comment

Announcement

Outliers with non-normally distributed panel data

Comment

Comment

Comment

Comment

Comment

Comment