Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Winsorization of treated group dummy in difference-in-differences analysis.

    Dear members,
    I am applying DiD analysis in my study. My question is, if I want to winsor the variables, then should I do it for the whole sample or do it based on the groups?

    Thanks and Regards

  • #2
    My first thought is this (though not determinative): If you are testing for a means difference across treated/untreated, then you don't do something to one group differently than you do for the other. If you do, then you are testing 2 things (treatment, winsor).

    With winsor, you are looking for extreme values in the distribution of a variable. There are cases where you'd do it by subgroups of a sample. Say you had price data on car models: Toyota Corolla and a Lexus RX450. $60,000 would be a weird value for a Corolla, but not the Lexus. If you winsor together, then you're not really doing what you want to do: you alter values for the Corolla at the low end and values for the Lexus at the high end, adjusting/tossing legitimate values.

    Say you had many car models, and maybe imported luxury cars are subject to a new tax (a treatment). I'd think you could winsor by model, but not because they are treated or untreated, but because their mean prices (distributions) are different and there's reason to believe some of the prices are peculiar.


    Comment


    • #3
      Thanks, George.
      Now, it is clear to me

      Regards.

      Comment


      • #4
        These questions are always weird for me, because I wrote a command in this terrritory, winsor from SSC.

        That was just the result of trying to help on some question or questions on Statalist in 1998; the original emails have long since disappeared, so a larger story would depend on someone else's fallible memory. Perhaps only Rich Goldstein of currently active Statalist members was also active in 1998.

        I see the point of winsorizing as a way of moving towards a robust summary of the level or variability of a variable that discounts outliers. Using it to modify the data in advance of some other analysis is quite a different ball game. The most important issue is how can you be confident that this is not only a good idea but also the best idea to deal with perceived problems. I get the impression that winsorizing is common in some areas of economics and finance and most people wanting to do it are essentially imitating some papers in a sub-field that did it previously. I have often asked here for references to authoritative texts or review papers that explain systematically that this is a good idea and never got any answer to that.

        The issue is not whether you have outliers or long tails that worry you. The issue is why winsorizing individual variables at a certain level is a good strategy to deal with that. I can't detail all the difficulties, but here's one more. Whether a particular observation is or is not an outlier that will prove awkward is a multivariate question; it's not best tackled by looking at marginal distributions one at a time.

        Working on a logarithmic scale somehow -- or using not means, but some other summary -- are, I guess, far more likely to be good ideas.

        Comment


        • #5
          Dear Nick,

          ..........I get the impression that winsorizing is common in some areas of economics and finance and most people wanting to do it are essentially imitating some papers in a sub-field that did it previously......................
          I agree with you about imitation. But people imitate it because it has become a standard procedure in these fields.

          The issue is not whether you have outliers or long tails that worry you. The issue is why winsorizing individual variables at a certain level is a good strategy to deal with that. I can't detail all the difficulties, but here's one more. Whether a particular observation is or is not an outlier that will prove awkward is a multivariate question; it's not best tackled by looking at marginal distributions one at a time.
          Yes, you are right about the homogeneous winsor treatment for all variables; it is not the right thing to do.


          Working on a logarithmic scale somehow -- or using not means, but some other summary -- are, I guess, far more likely to be good ideas.
          I understand the use of a logarithmic scale as a solution. But how do we reference using log scaling for all the variables? Generally, in these areas, authors scale some variables by some firm characteristic (say size) and take the log of some variables (like size itself or age, etc.)



          What do you mean by "or using not means."

          Thanks and Regards
          Last edited by Pranshu Tripathi; 30 May 2024, 06:53.

          Comment


          • #6
            Statistical science, and science generally, has many local practices that may be pervasive for a subgroup but are a really bad idea. Graphics is close to me, so I will give just two examples: Illustrating ANOVA with box plots that don't even show means. So-called dynamite, detonator or plunger plots that just summarize data by a summary bar and an error bar when much more detail is both possible and desirable.

            I once advised someone else's graduate student who was using PCA on the grounds that "this is the technique that people use in my field" when the real problem when it was identified turned out to need logit regression (which he had never heard of, but was happy to learn about).

            Most comparisons in this territory, and in statistical science more generally, seem to boil down to comparisons of means. If distributions make that problematic, use some other summary, say medians or geometric means, or work on a transformed scale.

            Using logarithms rarely means using logarithms on all variables. That is often impossible (e.g. with (0, 1) variables) or not a good idea. I have in mind principally an outcome variable and even in that context tend to consider than working with a logarithmic link (in GLM jargon) is usually a better strategy than logarithmic transformation of the outcome.

            Comment


            • #7
              while I appreciate Nick Cox 's belief in my memory, I admit that I do not remember the event that lead to him writing his program (yes, I was active then); however, my general feeling about winsorization is that one should not do it as it gives the impression, at least, of playing with the data to obtain a desired result and, generally, has no theoretical support

              Comment


              • #8
                Rich Goldstein Indeed. I once wrote a program, but what people do with it is beyond my control and often not at all what I would advise or support. No one else needs to care about that, except that I do.

                There is also a winsor2 which (as is fine) acknowledges code from winsor but otherwise is nothing to do with me.

                I found a use for Winsorizing as a Winsorized variance turns out to be useful in getting a rough confidence interval for trimmed means. There again, the point is just to get a robust or resistant summary, not to change the data in advance of something else.

                There might be a view that working on a transformed scale is also changing the data and there are pitfalls there too; transformations too can be singularly ill-advised.

                Comment


                • #9
                  Dear Nick

                  I missed that you are the writer of the winsor command.

                  It feels good that the Stata forum lets me put up my queries for contributors like you, and sadly, people like me are misusing this command.

                  For my current study, sadly, I have to follow the imitation game for publication requirements. But next time, I would love to explore your suggestions.


                  Thanks and regards.

                  Comment


                  • #10
                    Thanks for your thanks. My own preference to explain my position is not important here except to me. But I would tend to make the same points even if I had not written that command.

                    I appreciate that students at various levels are often under some obligation to follow particular analyses. At the same time, this phenomenon is one contributor to the tendency of strange practices to persist.

                    Comment

                    Working...
                    X