Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • panel data outliers

    Dear members,
    I have a daily averages data for 4 years of 60 stations and 6 variables. I am visualizing my data, i graphed each variable separately and noticed that i have many outliers. Is there anyway to tabulate these outliers. I even want to know can i draw a 3 sigma control chart in stata? Thanx in advanced

  • #2
    first, the choices for QC in Stata are very poor (I tried to get this beefed up many years ago and was told "no"); I know of nothing built in to do what you request; you can see what is available via
    Code:
    help qc
    second, if you have "many outliers" than my guess is that your, possibly implicit, model is not correct for those data - but you don't really give us any information, or a data example (see the FAQ on the right way to give data examples), so it is hard to say more

    Comment


    • #3
      Rich Goldstein it is a large panel data of 67206 observations as the following:
      Stations Date PM10_AVG PM25_AVG SO2_AVG NO2_AVG O3_AVG CO_AVG cooks cooks_pr_chi2 cooks_pr_F
      CA01R 1-Jan-18 22.154 8.967 0.0005 0.0019 0.0305 0.356 7.54E-05 1.93E-12 1.93E-12
      CA01R 2-Jan-18 25.912 11.405 0.0005 0.0021 0.0349 0.394 0.000137 1.16E-11 1.16E-11
      CA01R 3-Jan-18 22.494 8.776 0.0005 0.0035 0.0235 0.431 4E-05 2.87E-13 2.87E-13
      i applied Cook's distance ( variables are not normally distributed) and i got the last 3 columns on the left side. Should i have to compare the Cooks column values with the threshold? if yes, the outlier would be for which variable? thanx

      Comment


      • #4
        Amaa:
        variables should not be normally distributed in linear panel data regression (normality is a weak requirement for residual distribution).
        In addition, and more substantively, exception made for blatant examples of mistaken data entry, it may well be that the data generating process you're investigating allows "weird" values.
        As an aside, I do echo Rich's helpful recommendation about providing more details about the issue you're facing and/or sharing an excerpt/example of your dataset via -dataex-. Thanks.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Carlo Lazzaro thank you for your reply. I just tested normality to decide which to use Grubbs test or Cook;s distance in regards to outliers.

          here is an example of my dataset

          clear
          input str5 Stations int Date double(PM10_AVG PM25_AVG SO2_AVG NO2_AVG O3_AVG CO_AVG)
          "CA01R" 21185 22.154 8.967 .0005 .0019 .0305 .356
          "CA01R" 21186 25.912 11.405 .0005 .0021 .0349 .394
          "CA01R" 21187 22.494 8.776 .0005 .0035 .0235 .431
          "CA01R" 21188 22.417 9.058 .0005 .0042 .0214 .438
          "CA01R" 21189 20.607 11.961 .0005 .004 .0205 .513
          "CA01R" 21190 27.446 17.772 .0005 .005 .018 .629

          This is only small part of one station CA01R where i have 65 stations.
          after applying cooksd2 i got the following columns:

          clear
          input double(cooks cooks_pr_chi2 cooks_pr_F)
          .000029306059194225886 1.3824347496457198e-11 1.3825118885328571e-11
          .00007332342759780745 1.3687490962514275e-10 1.3688254634838553e-10
          3.327775302851359e-06 6.006980445989446e-14 6.007315652009467e-14
          5.191890392704617e-06 1.8263481763149803e-13 1.826450091110865e-13
          1.113697297470317e-07 1.2308117720173556e-17 1.2308804552233237e-17
          6.679639360915986e-07 1.0843159177225232e-15 1.0843764259143623e-15

          Thanx

          Last edited by Amaa Ahmed; 02 Jul 2023, 10:23.

          Comment


          • #6
            Amaa:
            why not presenting two regression tables with and without the so-called outliers of one station only?
            Kind regards,
            Carlo
            (Stata 19.0)

            Comment

            Working...
            X