Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing outliers

    Hi all,

    I am currently working on my undergraduate dissertation, title "The Macroeconomic Determinants of Mental Health". I was advised by my supervisor to remove (or transform) the outliers of my "Inflation" variable. I watched a tutorial on YouTube to aid me with this, however it has not worked.

    The code I used is:
    Code:
    ssc install winsor2 
    clonevar RealInflation=Inflation
    su RealInflation, d 
    replace RealInflation=r(p99) if RealInflation>=r(p99) & RealInflation <.
    
    xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year, fe cluster(CountryNum)
    estimates store RMAllCountriesDY
    xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="L", fe cluster(CountryNum)
    estimates store RMLLMIncomeDY
    xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="M", fe cluster(CountryNum)
    estimates store RMUMIncomeDY
    xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="H", fe cluster(CountryNum)
    estimates store RMHIncomeDY
    estimates table RMAllCountriesDY RMLLMIncomeDY RMUMIncomeDY RMHIncomeDY, star(0.1 0.05 0.01) stats(N r2)
    From my understanding, it will take the values of RealInflation greater than the 99th percentile and transform it so it is the 99th percentile value. However, when I run my regressions, it seems it has not taken into account this change, and the inflation values are as before.

    Is anyone able to see where I am going wrong with my code?
    Thank you

  • #2
    The 99th percentile will be returned by summarize as the maximum value in a sample so long as the sample size is 99 or less.

    So, your procedure depends on sample sizes being 100 or more -- as otherwise no value will be greater than the maximum.

    Consider also the possibility of tied values. Suppose you have 200 observations and the top 10 values are all 42. Then 42 will be the 99th percentile but no value will be greater. Note that the = sign in
    Code:
      
     replace RealInflation=r(p99) if RealInflation>=r(p99) & RealInflation <.
    does no harm but is redundant as replacing values by the same values is no change.

    It should be easier to see what is going on if you copy the results of

    Code:
    su Inflation, detail 

    Comment


    • #3
      Hello,

      After you have run this line:
      Code:
      replace RealInflation=r(p99) if RealInflation>=r(p99) & RealInflation <.
      did Stata main output screen say anything about "___ real changes made"? That should tell if the replacement works.

      And notice that this procedure just changed 1% of the data, if you didn't have a lot of extremely high points to begin with, or those high points were not influential to begin with, you may not see much change in the regression.

      Comment


      • #4
        Hi,

        Thank you both for your responses. I thought I would mention that I am working with country-level panel data, I don't know if that makes a difference to anything. Anyway, below are my results of
        Code:
        su RealInflation, d
        Click image for larger version

Name:	Screenshot 2021-04-09 at 13.57.50.png
Views:	1
Size:	107.6 KB
ID:	1602527



        When I ran
        Code:
        replace RealInflation=r(p99) if RealInflation>=r(p99) & RealInflation <.
        It said "47 real changes made".

        Instead, I tried running this
        Code:
        graph box Inflation
        hist Inflation
        ssc install winsor2
        clonevar RealInflation=Inflation
        su RealInflation, d
        replace RealInflation=r(p95) if RealInflation>=r(p95) & RealInflation <.
        
        xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year, fe cluster(CountryNum)
        estimates store RMAllCountriesDY
        xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="L", fe cluster(CountryNum)
        estimates store RMLLMIncomeDY
        xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="M", fe cluster(CountryNum)
        estimates store RMUMIncomeDY
        xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="H", fe cluster(CountryNum)
        estimates store RMHIncomeDY
        estimates table RMAllCountriesDY RMLLMIncomeDY RMUMIncomeDY RMHIncomeDY, star(0.1 0.05 0.01) stats(N r2)
        All I have done here is decided to change results larger than the 95th percentile instead of the 99th percentile.

        The results table was then:
        Click image for larger version

Name:	Screenshot 2021-04-09 at 14.21.31.png
Views:	1
Size:	83.2 KB
ID:	1602529



        For reference, the first column is for the full sample, followed by low, middle, and high income groupings.

        Without transforming the inflation outliers, the results table was:
        Click image for larger version

Name:	Screenshot 2021-04-09 at 14.25.36.png
Views:	1
Size:	81.1 KB
ID:	1602530



        As you can see, although the significance of the coefficients have not changed, the value of the coefficients have, so it looks like the outlier transformation worked. As the change is only very small, I think this shows that removing the Inflation outliers doesn't have a significant effect on the model, so the original model is robust. It seems the significance of the inflation coefficient in the second column isn't attributed to outliers.

        Does this explanation seem correct to you?

        Thanks

        Comment


        • #5
          In #1 you reported


          it has not worked

          when I run my regressions, it seems it has not taken into account this change, and the inflation values are as before.
          So, that was wrong, or else you misinterpreted your results. Otherwise put, there is thus nothing to explain here.

          Whether Winsorizing is a good idea is a different issue.

          Comment


          • #6
            Hi Nick,

            In #1, I used the code:
            Code:
            graph box Inflation
            hist Inflation
            ssc install winsor2 
            clonevar RealInflation=Inflation
            su RealInflation, d 
            replace RealInflation=r(p99) if RealInflation>=r(p99) & RealInflation <.
            
            xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year, fe cluster(CountryNum)
            estimates store RMAllCountriesDY
            xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="L", fe cluster(CountryNum)
            estimates store RMLLMIncomeDY
            xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="M", fe cluster(CountryNum)
            estimates store RMUMIncomeDY
            xtreg DepressionPrevM lnGDPpc Gini Unemp RealInflation FLaborforce GovExpE HealthExp i.Year if EconomicRegion2019=="H", fe cluster(CountryNum)
            estimates store RMHIncomeDY
            estimates table RMAllCountriesDY RMLLMIncomeDY RMUMIncomeDY RMHIncomeDY, star(0.1 0.05 0.01) stats(N r2)
            I thought it didn't work because my results didn't change. I think what actually happened is that the code did work, but my original model was robust enough that altering these outliers didn't change the regression results.

            Thanks

            Comment


            • #7
              It defies plausibility that Winsorizing changed about 1% of nearly 5000 observations, as intended, but that the regression results were unchanged as compared with before. xtreg does not employ robust estimation. In talking about results I am thinking totally literally, as I think experienced users typically do, about the command output.

              If your basic interpretation is essentially similar to before, that's good news and perhaps implies that Winsorizing isn't essential at all.

              Comment


              • #8
                Hi Nick,

                Sorry, I have a feeling I might be misunderstanding you.
                It defies plausibility that Winsorizing changed about 1% of nearly 5000 observations, as intended, but that the regression results were unchanged as compared with before.
                Are you saying that my code has not changed outliers, as intended? I'm not familiar with Winsorizing so not really sure if I'm using it correctly. Sorry if this is elementary, I'm pretty new to STATA as you can probably tell.

                Thanks

                Comment


                • #9
                  Not at all. I am saying that it did.

                  Comment


                  • #10
                    Hello Nick Cox
                    could you please contact me at [email protected] i need some help thank you

                    Comment


                    • #11
                      ASMA ALSHAREEF Please start a new thread with your question. Sorry if this disappoints, but I don't offer or provide private consultancy.

                      Indeed you already asked a question at https://www.statalist.org/forums/for...2664-quartiles

                      Comment


                      • #12
                        Hi Nick,

                        Thanks so much for your help. I really appreciate it, you have no idea

                        Comment

                        Working...
                        X