Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why does Stata graph box not ignore missing values _

    hi Stata Gurus,
    I am trying to check the outliners in my data. For example, the total asset (ta) of european banks in bank-quarter level

    . list ta in 10/20

    +----------+
    | ta |
    |----------|
    10. | . |
    11. | . |
    12. | . |
    13. | . |
    14. | 1.27e+09 |
    |----------|
    15. | 1.30e+09 |
    16. | 1.30e+09 |
    17. | 1.31e+09 |
    18. | 1.36e+09 |
    19. | 1.39e+09 |
    |----------|
    20. | 1.41e+09 |
    +----------+

    . sum ta

    Variable | Obs Mean Std. dev. Min Max
    -------------+---------------------------------------------------------
    ta | 2,748 1.62e+08 3.91e+08 152426 2.67e+09

    . codebook ta

    -------------------------------------------------------------------------------------------------------
    ta (unlabeled)
    -------------------------------------------------------------------------------------------------------

    Type: Numeric (float)

    Range: [152426.03,2.673e+09] Units: .01
    Unique values: 2,744 Missing .: 6,300/9,048

    Mean: 1.6e+08
    Std. dev.: 3.9e+08

    Percentiles: 10% 25% 50% 75% 90%
    1.7e+06 5.4e+06 2.0e+07 7.6e+07 4.5e+08

    I tried to detect the outliners by graphing boxplot and all of the observations which are NOT missing values are graphed as Outliners.
    . graph box ta - seems to assign numeric value to missing observations and include them when graphing
    https://goetheuniversitaet-my.sharepoint.com/:i:/g/personal/d8_6qmtk8o_goetheuniversitaet_onmicrosoft_com/EaGD2KcyY9NJtobXIPjZ0sUBmN54uKuBxdhK8CYA-06ssQ?e=unJuZe

    I tried to tell Stata to graph only non-missing value but nothing really works
    . graph box ta if ta >= 152426 *Graph only the observation with value starting from the min(ta)
    . graph box ta if ! missing (ta)

    All suggestion are appreciated,

    Best

  • #2
    Welcome to Statalist!

    I think there is some confusion in the terms here. Those values with a number in the box plot are "outside values" (some may call them "outliers") and they are NOT missing. A numeric missing values in Stata looks like a period (.). In your data, cases 10 to 13 are missing. Missing values are not plotted. If you look at the summary table, the maximum is 2.67e+9. And if you look at the box plot, the highest data point is also labeled as 2.67e+9. So, all those points with values are real data.

    If you do not want to see those extreme points, you can add -nooutsides- as an option after the graph box command like:
    Code:
    graph box ta, nooutsides
    But notice that this does not change the data. Only the graph. Those extreme values are still there.

    Comment


    • #3
      When a variable ranges from 152426 to 2.67 billion you'll find that working on logarithmic scale is a good idea. A box plot is fairly useless for such a variable. unless you do that.

      Comment

      Working...
      X