panel data outliers

Amaa Ahmed

Join Date: Jun 2023

Posts: 11
#1

panel data outliers

06 Jun 2023, 11:12

Dear members,
I have a daily averages data for 4 years of 60 stations and 6 variables. I am visualizing my data, i graphed each variable separately and noticed that i have many outliers. Is there anyway to tabulate these outliers. I even want to know can i draw a 3 sigma control chart in stata? Thanx in advanced
Tags: None
Rich Goldstein

Join Date: Mar 2014

Posts: 4439
#2

06 Jun 2023, 12:36

first, the choices for QC in Stata are very poor (I tried to get this beefed up many years ago and was told "no"); I know of nothing built in to do what you request; you can see what is available via

Code:

help qc

second, if you have "many outliers" than my guess is that your, possibly implicit, model is not correct for those data - but you don't really give us any information, or a data example (see the FAQ on the right way to give data examples), so it is hard to say more
2 likes
Comment

Amaa Ahmed

Join Date: Jun 2023
Posts: 11

02 Jul 2023, 08:34

Rich Goldstein it is a large panel data of 67206 observations as the following:

Stations	Date	PM10_AVG	PM25_AVG	SO2_AVG	NO2_AVG	O3_AVG	CO_AVG	cooks	cooks_pr_chi2	cooks_pr_F
CA01R	1-Jan-18	22.154	8.967	0.0005	0.0019	0.0305	0.356	7.54E-05	1.93E-12	1.93E-12
CA01R	2-Jan-18	25.912	11.405	0.0005	0.0021	0.0349	0.394	0.000137	1.16E-11	1.16E-11
CA01R	3-Jan-18	22.494	8.776	0.0005	0.0035	0.0235	0.431	4E-05	2.87E-13	2.87E-13

i applied Cook's distance ( variables are not normally distributed) and i got the last 3 columns on the left side. Should i have to compare the Cooks column values with the threshold? if yes, the outlier would be for which variable? thanx

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17674
#4

02 Jul 2023, 09:39

Amaa:
variables should not be normally distributed in linear panel data regression (normality is a weak requirement for residual distribution).
In addition, and more substantively, exception made for blatant examples of mistaken data entry, it may well be that the data generating process you're investigating allows "weird" values.
As an aside, I do echo Rich's helpful recommendation about providing more details about the issue you're facing and/or sharing an excerpt/example of your dataset via -dataex-. Thanks.

Kind regards,
Carlo
(Stata 19.0)
Comment
Amaa Ahmed

Join Date: Jun 2023

Posts: 11
#5

02 Jul 2023, 10:07

Carlo Lazzaro thank you for your reply. I just tested normality to decide which to use Grubbs test or Cook;s distance in regards to outliers.

here is an example of my dataset

clear
input str5 Stations int Date double(PM10_AVG PM25_AVG SO2_AVG NO2_AVG O3_AVG CO_AVG)
"CA01R" 21185 22.154 8.967 .0005 .0019 .0305 .356
"CA01R" 21186 25.912 11.405 .0005 .0021 .0349 .394
"CA01R" 21187 22.494 8.776 .0005 .0035 .0235 .431
"CA01R" 21188 22.417 9.058 .0005 .0042 .0214 .438
"CA01R" 21189 20.607 11.961 .0005 .004 .0205 .513
"CA01R" 21190 27.446 17.772 .0005 .005 .018 .629

This is only small part of one station CA01R where i have 65 stations.
after applying cooksd2 i got the following columns:

clear
input double(cooks cooks_pr_chi2 cooks_pr_F)
.000029306059194225886 1.3824347496457198e-11 1.3825118885328571e-11
.00007332342759780745 1.3687490962514275e-10 1.3688254634838553e-10
3.327775302851359e-06 6.006980445989446e-14 6.007315652009467e-14
5.191890392704617e-06 1.8263481763149803e-13 1.826450091110865e-13
1.113697297470317e-07 1.2308117720173556e-17 1.2308804552233237e-17
6.679639360915986e-07 1.0843159177225232e-15 1.0843764259143623e-15

Thanx

Last edited by Amaa Ahmed; 02 Jul 2023, 10:23.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17674
#6

02 Jul 2023, 10:16

Amaa:
why not presenting two regression tables with and without the so-called outliers of one station only?

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

panel data outliers

Comment

Comment

Comment

Comment

Comment