Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • winsorizing until 20%

    Click image for larger version

Name:	Screenshot 2024-11-03 064409.png
Views:	1
Size:	19.4 KB
ID:	1766829


    I have a variable with graph box like this. Can i do winsorizing until 20%?

  • #2
    Yes, you can, see here -- although (as always) the question is why you want to do this.

    BTW: Please read the Stata Forum's FAQ before posting (especially #12.2). It is always better to show us your data than to present a graph.

    Comment


    • #3
      Thank you for your attention to this matter, Sir. I wanna do panel data regression analysis. But, the majority of my data is like that. I am not sure whether my result is valid or not with this data.
      please, help to check my data in below picture.

      Click image for larger version

Name:	Screenshot 2024-11-03 104705.png
Views:	1
Size:	18.8 KB
ID:	1766835


      Best regard,
      Dirah

      Comment


      • #4
        I see absolutely no reason to Winsorize here. Winsorizing is least crazy if there is a bunch of dubious-looking outliers that in some sense might not belong with the main body of the data. Here there is no hint of anything of the kind, just moderate skewness and a longer tail than you might have guessed. Also, whether that drastic maltreatment of your data makes sense depends on what the variable is (no hint here; indeed why deliberately omit or obscure that detail?), whether it is an outcome or a predictor, and so forth.

        The plot

        Code:
        twoway function log(x), ra(0.2553 19.73)
        shows that working on logarithmic scale is, to me, an alternative to leaving the data as they are.

        I will repeat my standard challenge: Please cite a reputable text that explains why you should do this and indeed why 20% not 5, 10, 15, 25, 30%, and so on.

        Comment


        • #5
          Thank you for the information, sir. Because I am still a beginner and early in data analysis with winsorising.

          Reasons for winsorising 20%:
          Firstly, I looked at the graph box that I previously provided. Then, I considered the point above the whisker to be an outlier, so I tried winsorising from 1% to 20%, and at 20% winsorising, the point above the whisker was gone.

          Source:
          The winsorising method replaces p% of the data from the top and bottom elements with the remaining highest and lowest values. The percentage can be determined by the researcher based on need or experience (CH'NG CHEE, 2016).

          Comment


          • #6
            Thanks for your reply.

            My suggestion is that Winsorizing is a highly contentious procedure, so no beginner is well advised to use it unless they can defend that use convincingly.

            Your procedural rule is to ensure that no data point is more than 1.5 IQR from the nearer quartile.

            It follows that you would Winsorize samples of your size. even if they came from a normal or Gaussian. Here is a simple simulation of samples of your size from such a distribution. So by construction data are random samples from a normal distribution, and they are about as well behaved as data could be imagined, and there is no skewness in the underlying population. There are always points in the tails on box plots beyond the ends of the whiskers. That is not exceptional, let alone pathological.

            Code:
            clear 
            set obs `=1385*12'
            set seed 314159
            
            gen y = rnormal(0, 1)
            
            gen which = ceil(_n/1385)
            
            tab which 
            
            graph box y, by(which)
            Click image for larger version

Name:	dontwinsorize.png
Views:	1
Size:	37.7 KB
ID:	1766861

            Now the convention of showing such points individually on a box plot was never suggested as a criterion for identifying bad or suspicious data points that should be omitted or modified. It was always intended only as a way of identifying data points that should be thought about and as an initial procedure that might guide further analysis. See for example John Tukey's 1977. Exploratory Data Analysis. Reading, MA: Addison-Wesley. The context of Tukey's work was very different from ours 50 years later. He was focusing on what could be done with pen, paper and mental arithmetic -- not even slide rules or calculators. Also, most examples in his work were for much smaller samples than you have.

            The cost of doing that Winsorizing in your case was deciding that 40% of your values should be modified!

            Many variables arrive with plots like yours in #1. Also, if the problem is in the tail of high values, why does that imply hacking at the bottom 20% too?

            Sorry, but I don't recognise the reference CH'NG CHEE, 2016 but what you cite is vacuous. Researchers should always make decisions based on need or experience. Unless you have prior experience with Winsorizing, what is the need here?

            On the evidence you give, two routes seem defensible:

            1. Leave the data as they are.

            2. Work on a logarithmic scale.

            Indeed, trying both would be a way to find out how much difference the choice makes.

            I don't know your context here -- whether you are a student working on an assignment or project or a researcher working towards a paper or thesis. But either way, you would be well advised to talk this over at your workplace. If you have teachers or mentors telling you this is a good idea, they in turn should be explaining why.

            Comment


            • #7
              Dirah Lestari : But Ch'ng Chee (2016) only seems to show how to do winsorising (or detect outliers?), not why it should be done at all.

              In a Statalist thread 10 years ago Nick Cox nicely put it: "
              If the question is simple "How to get rid of outliers?" then there is a good simple long answer: "Don't (usually)" and a good simple short answer "Don't".
              A concise list of ways to deal with outliers has been put together by Nick here to which Richard Goldstein added:
              ... an outlier is a surprising result and it is often surprising because we have used a particular model -- thinking about why we obtained the surprise can sometimes lead to a different model without any outliers.
              (and for a general conclusion see here).

              BTW: Simply referring to Ch'ng Chee (2016) does not allow anyone to find the paper, we would need a full reference.
              Last edited by Dirk Enzmann; 03 Nov 2024, 05:32.

              Comment


              • #8
                In addition to Dirk Enzmann 's references see my reply agreeing with Rich

                https://www.stata.com/statalist/arch.../msg00241.html


                and this thread (which is much wider-ranging than the title implies)

                https://stats.stackexchange.com/ques...iers-with-mean

                Evidently in 2013 I was recycling parts of what had already been said on Statalist!

                Comment


                • #9
                  Okay, Sir. I will do your advise. Thank you, Sir.

                  Best Regard,
                  Dirah

                  Comment


                  • #10
                    Dear Mr. Dirk,

                    This is full reference.
                    CH’NG CHEE, K. (2016) Winsorize tree algorithm for handling outlier in classification problem. Doctor of Philosophy.

                    Comment


                    • #11
                      Thank you. The thesis morphed into an article has been published together with Mahat as

                      Ch’ng, C. K., & Mahat, N. I. (2020). Winsorize tree algorithm for handling outlier in classification problem. International Journal of Operational Research, 38(2), 278–293. https://doi.org/10.1504/IJOR.2020.107073

                      and is freely available here.

                      Comment


                      • #12
                        Thank you, Mr. Dirk.

                        I get this point from st: RE: IQR below. So, as advised by Mr.Nick, i will try to use the data as they are or work on a logarithmic scale.
                        "Dropping values more than 3 IQR away from the nearer quartile will in most instances throw out important information. It would throw away most major cities compared with cities in their country"

                        Comment


                        • #13
                          For what it's worth: In teaching I use this paper to provide a concrete example of why much caution (like Dirk Enzmann and Nick Cox are urging) should be exercised before trimming what might seem to be an "outlier."
                          https://www.jacc.org/doi/10.1016/073...2893%2990390-M

                          Comment


                          • #14
                            John Mullahy : Are you sure that the link pointing to the article you are using as an example of the costs of trimming is correct? I briefly scanned the article but couldn't find any mentioning of trimming or outlier treatment.

                            Comment


                            • #15
                              You are correct that the Powe et al. article doesn't mention trimming.

                              Why I use it to teach this point is this excerpt. I use this as a warning against automatically considering trimming for "outiers" without understanding that such data may contain real information.


                              Click image for larger version

Name:	Pages from Powe Outlier Osmolality Contrast 1-s2.0-073510979390390M-main.png
Views:	1
Size:	8.9 KB
ID:	1766958

                              Comment

                              Working...
                              X