Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • help cleaning dataset

    Hi everybody,

    I'm trying to clean up my dataset so I can actually start analyzing the data. I started off by using [tab var1] to create an overview and then [replace var1 =. if var1 .....] to remove the 'wrong' observations from my summary statistics. However, for variables with many values I cant use the [tab] command to create an overview and see the 'wrong' outliers. Could anyone please show me a method to clean those variables with many values.

    For example, (sum output)


    Variable | Obs Mean Std. Dev. Min Max
    -------------+------------------------------------------------------------------------
    curcd | 0
    at | 436197 15494.29 98727.65 0 3771200
    bkvlps | 381156 9665.382 569540.6 -1.38e+07 9.47e+07
    ceq | 435607 2661.754 10210.6 -136332 284434
    csho | 413093 233.7273 40434.87 0 2.60e+07
    -------------+------------------------------------------------------------------------
    (at = asset total, bkvlps = book value per share, ceq = common equity total, csho = common shares outstanding)

    I made the min / max observations that seem quite off to me bold. So, my question is how I can overview those (many value) variables and how can I clean them up.

    Thanks a lot for helping me!








  • #2
    The start of any data cleaning imo is always br[owse].

    Comment


    • #3
      So, it seems like Rick Ert is interested in identifying extreme values that are likely to be data errors. -tab- is very inefficient at doing this if there are a large number of different values. Some other ways of doing this would be -summ variable, detail- which will highlight the four highest and lowest values, and also give the 1st and 99th percentiles. These will provide clues to observations that might warrant additional exploration.

      But better, if there are known upper and lower limits that the values should normally fall in, then you can just explore the violations of the role with:

      Code:
      browse if !inrange(variable, lower_limit, upper_limit)
      This will show only observations with an out-of-range value.

      Comment


      • #4
        Thanks a lot for your fast reply Clyde! that's exactly what I was looking for.

        Comment

        Working...
        X