Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove 2.5% smallest and 2.5% largest observations from dataset

    Quick question: I want to remove the 2.5% smallest and 2.5% largest observations from a dataset. I have been searching quite some time for it today, but don't seem to find an easy solution. Any idea on how to do this.
    So for example, I have two colums:
    • debt
    • growth
    I want to remove all observations where debt is smaller than the 2.5th percentile and remove all observations where debt is larger than the 97.5th perctile.

    Thanks.

  • #2
    not sure why you want to do this (sounds like a bad idea to me), but suggest you look at "help pctile"; also, a search of statalist should find you plenty of other people who have wanted to do similar things

    Comment


    • #3
      This is a bad idea and has no theoretical justification. If you hope to use your variable debt as an outcome in regressSee: http://www.statalist.org/forums/foru...winsorize-data and Nick Cox's post at http://www.statalist.org/forums/foru...orize-by-group.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Originally posted by Rich Goldstein View Post
        not sure why you want to do this (sounds like a bad idea to me), but suggest you look at "help pctile"; also, a search of statalist should find you plenty of other people who have wanted to do similar things

        Just want to play around a little bit what that to see what happens.
        I have tried the search option, but did not really find what I was looking for.

        And I know how to generate that those percentile values, but don't seem to be able to figure out how to drop data that is larger or smaller than that specific value. I can of course do it manually (e.g. drop if debt > 80), but I want to do it for different datasets, so I don't have to visually look up the percentile values.
        Last edited by Willem Vanlaer; 06 Jul 2015, 11:18.

        Comment


        • #5
          Originally posted by Steve Samuels View Post
          This is a bad idea and has no theoretical justification. If you hope to use your variable debt as an outcome in regressSee: http://www.statalist.org/forums/foru...winsorize-data and Nick Cox's post at http://www.statalist.org/forums/foru...orize-by-group.
          As I mentioned above, not really looking for a theoretical justification. Just want to play around with my data a bit.

          Comment


          • #6
            Don't drop: just use an if qualifier to do some analyses so that you can compare with the full dataset. (Sorry, but I sign up with those saying this is usually a very bad idea, so I have no interest in spelling out exactly how to do it.)

            Comment


            • #7
              Originally posted by Nick Cox View Post
              Don't drop: just use an if qualifier to do some analyses so that you can compare with the full dataset. (Sorry, but I sign up with those saying this is usually a very bad idea, so I have no interest in spelling out exactly how to do it.)
              I of course want to compare it with the full dataset!

              The problem I have, is the following.
              I use this code to find the percentile I am looking for:
              Code:
              _pctile debt, p(97.5)
              return list
              This then gives me for example:
              scalars:
              r(r1) = 319
              However, if I want to save this value, I try the following code:

              Code:
              . gen pct3 = r(r1)
              But this then gives me a vector which has the same size as the vector of debt.
              And I just want to store it as a value (i.e. one number).

              Any suggestion on how to do that?
              Or is that helping me too much towards my "goal". :-)



              Comment


              • #8
                Code:
                // strategy 1
                sysuse nlsw88, clear
                summarize wage
                sort wage, stable
                local n1=floor(0.025*_N)
                local n2=ceil(0.975*_N)
                keep in `n1'/`n2'
                summarize wage
                
                // strategy 2
                sysuse nlsw88, clear
                _pctile wage, p(2.5)
                local w1=r(r1)
                _pctile wage, p(97.5)
                local w2=r(r2)
                keep if inrange(wage,`w1',`w2')
                summarize wage

                Comment


                • #9
                  Now that Sergiy has given the game away, in Stata terms we should warn against Strategy 1. It's blind to missing values and doesn't generalize so easily to support subset selection.

                  Comment


                  • #10
                    Originally posted by Sergiy Radyakin View Post
                    Code:
                    // strategy 1
                    sysuse nlsw88, clear
                    summarize wage
                    sort wage, stable
                    local n1=floor(0.025*_N)
                    local n2=ceil(0.975*_N)
                    keep in `n1'/`n2'
                    summarize wage
                    
                    // strategy 2
                    sysuse nlsw88, clear
                    _pctile wage, p(2.5)
                    local w1=r(r1)
                    _pctile wage, p(97.5)
                    local w2=r(r2)
                    keep if inrange(wage,`w1',`w2')
                    summarize wage

                    Ah, I was quite close with what I already had. Just had to use the local command. Thanks a lot!

                    Comment


                    • #11
                      Originally posted by Sergiy Radyakin View Post
                      Code:
                      // strategy 1
                      sysuse nlsw88, clear
                      summarize wage
                      sort wage, stable
                      local n1=floor(0.025*_N)
                      local n2=ceil(0.975*_N)
                      keep in `n1'/`n2'
                      summarize wage
                      
                      // strategy 2
                      sysuse nlsw88, clear
                      _pctile wage, p(2.5)
                      local w1=r(r1)
                      _pctile wage, p(97.5)
                      local w2=r(r2)
                      keep if inrange(wage,`w1',`w2')
                      summarize wage

                      Hey, I am checking my results, but something appears to go wrong. It seems that only the values below the 2.5 percentile are removed, but that the ones above the 97.5 are still there. I can't seem to figure out what's wrong with the syntax of the inrange command. any ideas?

                      Comment


                      • #12
                        The code

                        Code:
                         
                        _pctile wage, p(97.5) 
                        local w2 = r(r2)
                        asks for one percentile, but then picks up the second result. But there is no second result, so Stata uses missing. That then bites you downstream.

                        The intended code was possibly

                        Code:
                         
                        _pctile wage, p(2.5 97.5) 
                        local w1 = r(r1)
                        local w2 = r(r2)

                        Comment


                        • #13
                          I know this is an old post, but can you please guide how to write this in a loop?

                          Code:
                          Code:
                          foreach var of varlist x1 x2 x3 x4 x6 x6 x7 x8 {
                              _pctile `var', p(1 99)
                              local w1`var' = r(r1)
                              local w2`var' = r(r2)
                              keep if inrange(`var', `w1`var'',`w2`var'')
                          }
                          My guess is that the last line of code to put a local in a local in making the problem.

                          Thanks in advance!

                          Comment


                          • #14
                            I suggest that you do this instead.



                            Code:
                            gen inside = 1
                            foreach var of varlist x1 x2 x3 x4 x6 x6 x7 x8 {
                                _pctile `var', p(1 99)
                                replace inside = 0 if !inrange(`var', r(r1), r(r2))
                            }
                            That way you get an indicator variable with value 1 if all variables have values between the designated percentiles and 0 otherwise.

                            In your approach, the extra locals are not needed, as you can and should use the saved results directly.

                            Getting rid of observations like this is crude and arbitrary. It is not even guaranteed to lose observations you might want to think outliers, as there is no consideration of what is an outlier in the full data space. You can easily sketch a scatter plot with a strong outlier that lies within that interval on each of the two variables, and same applies to any higher-dimensional space.

                            By all means compare results for the full dataset and any subset of interest. by using a qualifier such as
                            if inside

                            If you are under orders to do what you sketch in #13, my approach does not rule that out. You just keep all the data to allow checks on the procedure.
                            Last edited by Nick Cox; 06 Feb 2022, 05:03.

                            Comment


                            • #15
                              There is another problem with #13. The results will in practice depend on the order of mention of the variables.

                              Comment

                              Working...
                              X