Outliers in panel data for countries

James Shepherd

Join Date: Jan 2025

Posts: 6
#1

Outliers in panel data for countries

03 Jan 2025, 10:57

Hi, I'm attempting to find the determinants of FDI in developing countries using multiple variables. I have completed the analysis already using a fixed effects model however I am now concerned I have not done enough to address outliers, with a low R2 as a consequence. Obviously, large variations in variables is to be expected due to country specific factors. I have logged some values when performing the model, but still within my panel large variations exist, with Z-score etc indicating outliers. Should I do more to address the points or will this reduce heterogeneity? Attached is the summary statistic for the variables.
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3152
#2

03 Jan 2025, 13:36

what makes you think they are outliers and not just the variability within developing countries? Do prior studies address outliers?
2 likes
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17709

04 Jan 2025, 02:49

James:
as an aside to George's helpful reply:
1) it is not clear what R-sq you're referring to. Please note the R-sq is the one to look at when using the -fe- estimator;
2) once the time-invariant variables are wiped out, the -fe- estimator needs a remarkable within panel variation to work at its best. Is it your case?
3) have you checked the specification of the functional form of the regressand, repeating by hand the procedure described in -linktest- entry, Stata .pdf manual? See the following toy-example, if interested:

Code:

. use https://www.stata-press.com/data/r18/nlswork2.dta
(National Longitudinal Survey of Young Women, 14-24 years old in 1968)

. xtreg ln_wage c.age##c.age i.year, fe vce(cluster idcode)

Fixed-effects (within) regression               Number of obs     =     16,085
Group variable: idcode                          Number of groups  =      3,913

R-squared:                                      Obs per group:
     Within  = 0.1044                                         min =          1
     Between = 0.0554                                         avg =        4.1
     Overall = 0.0494                                         max =          9

                                                F(10, 3912)       =      78.73
corr(u_i, Xb) = -0.0091                         Prob > F          =     0.0000

                             (Std. err. adjusted for 3,913 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         age |   .0893661   .0202488     4.41   0.000      .049667    .1290653
             |
 c.age#c.age |  -.0017257   .0002332    -7.40   0.000    -.0021829   -.0012685
             |
        year |
         69  |   .0802262   .0192531     4.17   0.000     .0424792    .1179733
         70  |   .0687796   .0360133     1.91   0.056    -.0018271    .1393863
         71  |   .1203864   .0532847     2.26   0.024      .015918    .2248549
         72  |   .1398369   .0704264     1.99   0.047     .0017609    .2779129
         73  |   .1517485   .0880691     1.72   0.085    -.0209171    .3244141
         75  |   .1711454   .1219438     1.40   0.161    -.0679341    .4102249
         77  |   .2449266   .1571156     1.56   0.119    -.0631096    .5529629
         78  |   .3026813   .1753501     1.73   0.084    -.0411049    .6464676
             |
       _cons |   .2982693   .3554177     0.84   0.401    -.3985523    .9950908
-------------+----------------------------------------------------------------
     sigma_u |  .37193126
     sigma_e |   .2654505
         rho |  .66252376   (fraction of variance due to u_i)
------------------------------------------------------------------------------

. predict fitted, xb
(9 missing values generated)

. g sq_fitted=fitted^2
(9 missing values generated)

. xtreg ln_wage fitted sq_fitted , fe vce(cluster idcode)

Fixed-effects (within) regression               Number of obs     =     16,085
Group variable: idcode                          Number of groups  =      3,913

R-squared:                                      Obs per group:
     Within  = 0.1047                                         min =          1
     Between = 0.0559                                         avg =        4.1
     Overall = 0.0506                                         max =          9

                                                F(2, 3912)        =     361.79
corr(u_i, Xb) = -0.0060                         Prob > F          =     0.0000

                             (Std. err. adjusted for 3,913 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
      fitted |   2.263008   .7442291     3.04   0.002     .8038945    3.722122
   sq_fitted |  -.4017954    .236566    -1.70   0.090    -.8655997    .0620088
       _cons |  -.9887445   .5845951    -1.69   0.091    -2.134884    .1573954
-------------+----------------------------------------------------------------
     sigma_u |  .37183192
     sigma_e |   .2653217
         rho |  .66262133   (fraction of variance due to u_i)
------------------------------------------------------------------------------
.

As the sq_fitted has bo explanatory power (p>0.05), there is no evidence of model misspecification (despite a low within R_sq).

Kind regards,
Carlo
(Stata 19.0)

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

06 Jan 2025, 04:44

Speaking very generally:

You need a definition or criterion of outliers before you can identify them. One definition is that a data point is an outlier if it is surprising in the light of a model for the data, which is deliberately a chicken-and-egg definition, in that you need the model too to make any decision. A simple but common example is that outliers on raw scales may seem quite natural on transformed (e.g.) logarithmic scales, or that what doesn't look good compared with a normal distribution makes more sense compared with a gamma or lognormal (always noting that typically what matters are conditional distributions, not marginal distributions).

It's not obvious that outliers would act to deflate R-square or any other measure of goodness of fit; they might inflate it, and that might be natural or acceptable too. It can be obvious that points are outliers on say on a scatter plot but far more difficult with several variables.

It seems unfashionable but I am in favour of glancing at diagnostic plots once you have a tentative model. Simple plots like residuals vs fitted and observed vs fitted should be used much more than they appear to be.
1 like
Comment

Announcement