Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Outliers in panel data for countries

    Hi, I'm attempting to find the determinants of FDI in developing countries using multiple variables. I have completed the analysis already using a fixed effects model however I am now concerned I have not done enough to address outliers, with a low R2 as a consequence. Obviously, large variations in variables is to be expected due to country specific factors. I have logged some values when performing the model, but still within my panel large variations exist, with Z-score etc indicating outliers. Should I do more to address the points or will this reduce heterogeneity? Attached is the summary statistic for the variables.
    Click image for larger version

Name:	Image 03-01-2025 at 17.56.jpeg
Views:	1
Size:	176.0 KB
ID:	1770240

  • #2
    what makes you think they are outliers and not just the variability within developing countries? Do prior studies address outliers?

    Comment


    • #3
      James:
      as an aside to George's helpful reply:
      1) it is not clear what R-sq you're referring to. Please note the R-sq is the one to look at when using the -fe- estimator;
      2) once the time-invariant variables are wiped out, the -fe- estimator needs a remarkable within panel variation to work at its best. Is it your case?
      3) have you checked the specification of the functional form of the regressand, repeating by hand the procedure described in -linktest- entry, Stata .pdf manual? See the following toy-example, if interested:
      Code:
      . use https://www.stata-press.com/data/r18/nlswork2.dta
      (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
      
      . xtreg ln_wage c.age##c.age i.year, fe vce(cluster idcode)
      
      Fixed-effects (within) regression               Number of obs     =     16,085
      Group variable: idcode                          Number of groups  =      3,913
      
      R-squared:                                      Obs per group:
           Within  = 0.1044                                         min =          1
           Between = 0.0554                                         avg =        4.1
           Overall = 0.0494                                         max =          9
      
                                                      F(10, 3912)       =      78.73
      corr(u_i, Xb) = -0.0091                         Prob > F          =     0.0000
      
                                   (Std. err. adjusted for 3,913 clusters in idcode)
      ------------------------------------------------------------------------------
                   |               Robust
           ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               age |   .0893661   .0202488     4.41   0.000      .049667    .1290653
                   |
       c.age#c.age |  -.0017257   .0002332    -7.40   0.000    -.0021829   -.0012685
                   |
              year |
               69  |   .0802262   .0192531     4.17   0.000     .0424792    .1179733
               70  |   .0687796   .0360133     1.91   0.056    -.0018271    .1393863
               71  |   .1203864   .0532847     2.26   0.024      .015918    .2248549
               72  |   .1398369   .0704264     1.99   0.047     .0017609    .2779129
               73  |   .1517485   .0880691     1.72   0.085    -.0209171    .3244141
               75  |   .1711454   .1219438     1.40   0.161    -.0679341    .4102249
               77  |   .2449266   .1571156     1.56   0.119    -.0631096    .5529629
               78  |   .3026813   .1753501     1.73   0.084    -.0411049    .6464676
                   |
             _cons |   .2982693   .3554177     0.84   0.401    -.3985523    .9950908
      -------------+----------------------------------------------------------------
           sigma_u |  .37193126
           sigma_e |   .2654505
               rho |  .66252376   (fraction of variance due to u_i)
      ------------------------------------------------------------------------------
      
      . predict fitted, xb
      (9 missing values generated)
      
      . g sq_fitted=fitted^2
      (9 missing values generated)
      
      . xtreg ln_wage fitted sq_fitted , fe vce(cluster idcode)
      
      Fixed-effects (within) regression               Number of obs     =     16,085
      Group variable: idcode                          Number of groups  =      3,913
      
      R-squared:                                      Obs per group:
           Within  = 0.1047                                         min =          1
           Between = 0.0559                                         avg =        4.1
           Overall = 0.0506                                         max =          9
      
                                                      F(2, 3912)        =     361.79
      corr(u_i, Xb) = -0.0060                         Prob > F          =     0.0000
      
                                   (Std. err. adjusted for 3,913 clusters in idcode)
      ------------------------------------------------------------------------------
                   |               Robust
           ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
            fitted |   2.263008   .7442291     3.04   0.002     .8038945    3.722122
         sq_fitted |  -.4017954    .236566    -1.70   0.090    -.8655997    .0620088
             _cons |  -.9887445   .5845951    -1.69   0.091    -2.134884    .1573954
      -------------+----------------------------------------------------------------
           sigma_u |  .37183192
           sigma_e |   .2653217
               rho |  .66262133   (fraction of variance due to u_i)
      ------------------------------------------------------------------------------
      .
      As the sq_fitted has bo explanatory power (p>0.05), there is no evidence of model misspecification (despite a low within R_sq).
      Kind regards,
      Carlo
      (StataNow 18.5)

      Comment


      • #4
        Speaking very generally:

        You need a definition or criterion of outliers before you can identify them. One definition is that a data point is an outlier if it is surprising in the light of a model for the data, which is deliberately a chicken-and-egg definition, in that you need the model too to make any decision. A simple but common example is that outliers on raw scales may seem quite natural on transformed (e.g.) logarithmic scales, or that what doesn't look good compared with a normal distribution makes more sense compared with a gamma or lognormal (always noting that typically what matters are conditional distributions, not marginal distributions).

        It's not obvious that outliers would act to deflate R-square or any other measure of goodness of fit; they might inflate it, and that might be natural or acceptable too. It can be obvious that points are outliers on say on a scatter plot but far more difficult with several variables.

        It seems unfashionable but I am in favour of glancing at diagnostic plots once you have a tentative model. Simple plots like residuals vs fitted and observed vs fitted should be used much more than they appear to be.

        Comment

        Working...
        X