PPML for gravity model estimation, question on CLUSTERING and ROBUST STANDARD ERRORS

Erik Katovich

Join Date: Apr 2016

Posts: 3
#1

PPML for gravity model estimation, question on CLUSTERING and ROBUST STANDARD ERRORS

06 Apr 2016, 13:47

Dear all,

I am estimating a gravity model with PPML (Poisson Pseudo-Maximum Likelihood estimator) in order to account for zero trade values. The data set includes bilateral trade between a reference country and 58 partner countries for a single year. My dependent variable (trade) is scaled into thousands of dollars, and is left in levels. Explanatory variables include gdp (scaled into thousands and natural logged), distance (natural logged), and dummies for common language and contiguity. I also include an indexed policy variable (an index of GMO regulations, my variable of interest) ranging from 0 to 5.

My current model thus looks like:
xi: ppml trade ln(gdp) ln(distance) i.contiguity i.common_language gmoindex

When I run this in Stata, my output looks like:

xi: ppml trade gdp dist i.contig i.comlang gmoindex
i.contig _Icontig_0-1 (naturally coded; _Icontig_0 omitted)
i.comlang _Icomlang_0-1 (naturally coded; _Icomlang_0 omitted)

note: checking the existence of the estimates
WARNING: trade has very large values, consider rescaling
WARNING: gdp has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

note: starting ppml estimation
note: trade has noninteger values

Iteration 1: deviance = 1.35e+07
Iteration 2: deviance = 7885809
Iteration 3: deviance = 5257384
Iteration 4: deviance = 4276665
Iteration 5: deviance = 4098061
Iteration 6: deviance = 4089311
Iteration 7: deviance = 4089280
Iteration 8: deviance = 4089280
Iteration 9: deviance = 4089280

Number of parameters: 6
Number of observations: 58
Pseudo log-likelihood: -2044797.2
R-squared: .98220469
Option strict is: off
WARNING: The model appears to overfit some observations with trade=0
-------------------------------------------------------------------------------
| Semirobust
soyaArg2008 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
gdp | 2.28884 .4709613 4.86 0.000 1.365773 3.211908
dist | 9.712658 1.676844 5.79 0.000 6.426104 12.99921
_Icontig_1 | 19.45055 4.171823 4.66 0.000 11.27393 27.62717
_Icomlang_1 | 5.550667 1.874701 2.96 0.003 1.87632 9.225013
gmoindex | -10.93835 3.844079 -2.85 0.004 -18.4726 -3.404092
_cons | -126.2497 23.13639 -5.46 0.000 -171.5962 -80.90324
-------------------------------------------------------------------------------

I have two concerns about my output.
I am concerned about controlling for heteroscedasticity, and thus want robust standard errors. However, various attempts have only produced “semirobust standard errors” for me. Using the ,robust option does not work with ppml. After glancing through other posts, it appears that clustering may resolve this problem? However, I don’t understand what type of clusters I should use or what variables to cluster. Would this give robust std. errors, or is there another way to get robust results?

The above output gives the warning that the model “appears to overfit some observations with trade=0.” I believe this problem has to do with defining/omitting dummy variables (based on the Statalist post: http://www.statalist.org/forums/foru...ariance-matrix). I tried using xi, noomit: ppml [model], but the error did not go away. I also tried dropping the i. prefix from my dummies (which I already created manually in Excel), but this didn’t remove the warning either.

I appreciate your help, and I apologize that much of this content is new for me, so some of my problems may be quite naïve.

Best regards,
Erik
Tags: None
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#2

07 Apr 2016, 00:23

Erik,

The standard errors are robust, "semi-robust" is just a label used by Stata. Having said that, it is customary to cluster by country pair.

Looking at your estimation results it looks as if you are indeed estimating a model where some coefficients are not identified. Please estimate the model with dummies for all categories and drop the constant.

Finally, notice that your sample is very small.

Best wishes,

Joao
Comment
Erik Katovich

Join Date: Apr 2016

Posts: 3
#3

26 Apr 2016, 09:10

Hi Joao,

Thank you so much for your response!

Following your suggestion, I re-estimated my model without a constant term. This, however, appears to have a large effect on the signs and magnitude of my results, and I am trying to understand what exactly including/excluding the constant is doing, and which of my results I should believe.
When I estimate my model with a constant term, I get the following output (which includes the warning that the model appears to overfit some observations):

. xi: ppml soyaArg2008 gdp2008 distArg i.contigArg i.comlangArg gmoindex

i.contigArg _IcontigArg_0-1 (naturally coded; _IcontigArg_0 omitted)
i.comlangArg _IcomlangAr_0-1 (naturally coded; _IcomlangAr_0 omitted)

note: checking the existence of the estimates
WARNING: soyaArg2008 has very large values, consider rescaling
WARNING: gdp2008 has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

note: starting ppml estimation
note: soyaArg2008 has noninteger values

Iteration 1: deviance = 1.35e+07
Iteration 2: deviance = 7885809
Iteration 3: deviance = 5257384
Iteration 4: deviance = 4276665
Iteration 5: deviance = 4098061
Iteration 6: deviance = 4089311
Iteration 7: deviance = 4089280
Iteration 8: deviance = 4089280
Iteration 9: deviance = 4089280

Number of parameters: 6
Number of observations: 58
Pseudo log-likelihood: -2044797.2
R-squared: .98220469
Option strict is: off
WARNING: The model appears to overfit some observations with soyaArg2008=0
-------------------------------------------------------------------------------
| Semirobust
soyaArg2008 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
gdp2008 | 2.28884 .4709613 4.86 0.000 1.365773 3.211908
distArg | 9.712658 1.676844 5.79 0.000 6.426104 12.99921
_IcontigArg_1 | 19.45055 4.171823 4.66 0.000 11.27393 27.62717
_IcomlangAr_1 | 5.550667 1.874701 2.96 0.003 1.87632 9.225013
gmoindex | -10.93835 3.844079 -2.85 0.004 -18.4726 -3.404092
_cons | -126.2497 23.13639 -5.46 0.000 -171.5962 -80.90324
-------------------------------------------------------------------------------

However, when I estimate it without a constant term, I get this output, which is significantly different in sign and magnitude:

xi: ppml soyaArg2008 gdp2008 distArg i.contigArg i.comlangArg gmoindex, noconstant

i.contigArg _IcontigArg_0-1 (naturally coded; _IcontigArg_0 omitted)
i.comlangArg _IcomlangAr_0-1 (naturally coded; _IcomlangAr_0 omitted)

note: checking the existence of the estimates
WARNING: soyaArg2008 has very large values, consider rescaling
WARNING: gdp2008 has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

note: starting ppml estimation
note: soyaArg2008 has noninteger values

Iteration 1: deviance = 4.22e+07
Iteration 2: deviance = 2.48e+07
Iteration 3: deviance = 2.01e+07
Iteration 4: deviance = 1.92e+07
Iteration 5: deviance = 1.91e+07
Iteration 6: deviance = 1.91e+07
Iteration 7: deviance = 1.91e+07
Iteration 8: deviance = 1.91e+07

Number of parameters: 5
Number of observations: 58
Pseudo log-likelihood: -9561509
R-squared: .05665912
Option strict is: off
-------------------------------------------------------------------------------
| Semirobust
soyaArg2008 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
--------------+----------------------------------------------------------------
gdp2008 | .6238821 .1778681 3.51 0.000 .2752669 .9724972
distArg | -.0493533 .3417985 -0.14 0.885 -.7192661 .6205594
_IcontigArg_1 | -1.382808 1.750803 -0.79 0.430 -4.814319 2.048702
_IcomlangAr_1 | -2.029582 1.064701 -1.91 0.057 -4.116357 .0571938
gmoindex | -1.83519 .9883912 -1.86 0.063 -3.772401 .1020208
-------------------------------------------------------------------------------

Furthermore, when I define my dummy variables within Stata (I’m not sure what exactly you meant by “estimating the model with dummies for all categories”), I get the following output, which differs for the no-constant regression, and is identical but with flipped signs on dummies for the regression with a constant.

tabulate contigArg, generate(marg)

contigArg | Freq. Percent Cum.
------------+-----------------------------------
0 | 166 97.08 97.08
1 | 5 2.92 100.00
------------+-----------------------------------
Total | 171 100.00

. tabulate comlangArg, generate(narg)

comlangArg | Freq. Percent Cum.
------------+-----------------------------------
0 | 151 88.30 88.30
1 | 20 11.70 100.00
------------+-----------------------------------
Total | 171 100.00

. xi: ppml soyaArg2008 gdp2008 distArg marg1 narg1 gmoindex, noconstant

note: checking the existence of the estimates
WARNING: soyaArg2008 has very large values, consider rescaling
WARNING: gdp2008 has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

note: starting ppml estimation
note: soyaArg2008 has noninteger values

Iteration 1: deviance = 4.02e+07
Iteration 2: deviance = 2.46e+07
Iteration 3: deviance = 2.05e+07
Iteration 4: deviance = 1.97e+07
Iteration 5: deviance = 1.96e+07
Iteration 6: deviance = 1.96e+07
Iteration 7: deviance = 1.96e+07
Iteration 8: deviance = 1.96e+07

Number of parameters: 5
Number of observations: 58
Pseudo log-likelihood: -9824457.9
R-squared: .07352513
Option strict is: off
------------------------------------------------------------------------------
| Semirobust
soyaArg2008 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gdp2008 | .5483715 .1512694 3.63 0.000 .251889 .844854
distArg | .1305202 .4223417 0.31 0.757 -.6972542 .9582946
marg1 | -.7637511 1.319131 -0.58 0.563 -3.349201 1.821699
narg1 | .5504681 .5499231 1.00 0.317 -.5273615 1.628298
gmoindex | -1.809164 .8919584 -2.03 0.043 -3.557371 -.0609579
------------------------------------------------------------------------------

. xi: ppml soyaArg2008 gdp2008 distArg marg1 narg1 gmoindex

note: checking the existence of the estimates
WARNING: soyaArg2008 has very large values, consider rescaling
WARNING: gdp2008 has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

note: starting ppml estimation
note: soyaArg2008 has noninteger values

Iteration 1: deviance = 1.35e+07
Iteration 2: deviance = 7885809
Iteration 3: deviance = 5257384
Iteration 4: deviance = 4276665
Iteration 5: deviance = 4098061
Iteration 6: deviance = 4089311
Iteration 7: deviance = 4089280
Iteration 8: deviance = 4089280
Iteration 9: deviance = 4089280

Number of parameters: 6
Number of observations: 58
Pseudo log-likelihood: -2044797.2
R-squared: .98220469
Option strict is: off
WARNING: The model appears to overfit some observations with soyaArg2008=0
------------------------------------------------------------------------------
| Semirobust
soyaArg2008 | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gdp2008 | 2.28884 .4709613 4.86 0.000 1.365773 3.211908
distArg | 9.712658 1.676844 5.79 0.000 6.426104 12.99921
marg1 | -19.45055 4.171823 -4.66 0.000 -27.62717 -11.27393
narg1 | -5.550667 1.874701 -2.96 0.003 -9.225013 -1.87632
gmoindex | -10.93835 3.844079 -2.85 0.004 -18.4726 -3.404092
_cons | -101.2485 17.84926 -5.67 0.000 -136.2324 -66.26462
------------------------------------------------------------------------------

Thus, I am rather confused about the effect of the constant and definition of dummy variables on my result. I apologize that this post is so long and non-technical, but I’m struggling with what to do about such wide variation in my results.

Again, thank you for any response.

Best regards,

Erik
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#4

26 Apr 2016, 13:07

Dear Erik,

I can see you are confused ;-)

What you have to do is to estimate the model with dummies for all categories and no constant. For example, if one of your dummies was for gender you should include both the dummy for males and the dummy for females, rather than excluding one of them and therefore defining the base category.

So, generate all the dummies you need (without excluding the base category), estimate the model without constant but including all the dummies (do not use the xi prefix). The results you get should the the ones you want. Please post them here so that we can compare them with the others you got, OK?

Best wishes,

Joao
Comment
Erik Katovich

Join Date: Apr 2016

Posts: 3
#5

08 May 2016, 15:26

Dear Joao,

Thank you again for your response. As you suggested, I re-estimated my model with all dummies and without a constant term. I’ve posted an example of the results below (1). These results look better than what I was getting before! And I actually got identical results by again avoiding the xi prefix, but using only the default constants and including a constant (2, below).

I do still have one concern. I previously estimated the same model (with different variables) using OLS and Tobit estimators, and each of these estimators produced results comparable to the other (3 and 4, below). Now that I estimate the model with PPML, however, my results change slightly (5 and 6, below). Specifically, my focus variable (in this case, called Labeling) goes from being insignificant under OLS and Tobit to significantly positive under PPML…

Question: Is this simply because of PPML’s alternative way of estimating the model, or is there still something going wrong in my PPML code?

Alternatively, could it be that the warnings of “dep. var. has very large values, consider rescaling” and “gdp has very large values, consider rescaling or recentering” are highlighting the existence of some outlier that alters the PPML results? There are a handful of trade values and gdp values in my data (China, for instance) that could be considered outliers. Note, though, that I have already scaled all dollar values into thousands of dollars.

I look forward to your thoughts! And sincerely, thank you for your patience.

Best regards,
Erik

1. SUGGESTED METHOD, WITH ALL DUMMY VARIABLES INCLUDED, NO CONSTANT:
ppml soyaArg gdp distArg contigArgentina2 contigArgentina1 comlangArgentina2 comlangArgentina1 gmoindex, noconstant

note: checking the existence of the estimates
WARNING: soyaArg has very large values, consider rescaling
WARNING: gdp has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

note: comlangArgentina1 omitted because of collinearity

note: starting ppml estimation
note: soyaArg has noninteger values

Iteration 1: deviance = 3.03e+07
Iteration 2: deviance = 1.85e+07
Iteration 3: deviance = 1.41e+07
Iteration 4: deviance = 1.28e+07
Iteration 5: deviance = 1.27e+07
Iteration 6: deviance = 1.27e+07
Iteration 7: deviance = 1.27e+07
Iteration 8: deviance = 1.27e+07
Iteration 9: deviance = 1.27e+07

Number of parameters: 6
Number of observations: 174
Pseudo log-likelihood: -6330420.1
R-squared: .71772131
Option strict is: off
-----------------------------------------------------------------------------------
| Semirobust
defsoyaArg | Coef. Std. Err. z P>|z| [95% Conf. Interval]
------------------+----------------------------------------------------------------
gdp | 1.827792 .2068734 8.84 0.000 1.422327 2.233256
distArg | 7.212482 .9982246 7.23 0.000 5.255998 9.168967
contigArgentina2 | -79.94674 10.18034 -7.85 0.000 -99.89984 -59.99364
contigArgentina1 | -93.93674 12.19166 -7.70 0.000 -117.832 -70.04153
comlangArgentina2 | 4.975368 .8643526 5.76 0.000 3.281268 6.669468
gmoindex | -8.602154 1.816455 -4.74 0.000 -12.16234 -5.041968
-----------------------------------------------------------------------------------

2. ALTERNATIVE METHOD WITH DEFAULT VARIABLES AND CONSTANT (SAME RESULTS)
ppml soyaArg gdp distArg contigArg comlangArg gmoindex

note: checking the existence of the estimates
WARNING: soyaArg has very large values, consider rescaling
WARNING: gdp has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

note: starting ppml estimation
note: soyaArg has noninteger values

Iteration 1: deviance = 3.03e+07
Iteration 2: deviance = 1.85e+07
Iteration 3: deviance = 1.41e+07
Iteration 4: deviance = 1.28e+07
Iteration 5: deviance = 1.27e+07
Iteration 6: deviance = 1.27e+07
Iteration 7: deviance = 1.27e+07
Iteration 8: deviance = 1.27e+07
Iteration 9: deviance = 1.27e+07

Number of parameters: 6
Number of observations: 174
Pseudo log-likelihood: -6330420.1
R-squared: .71772131
Option strict is: off
------------------------------------------------------------------------------
| Semirobust
soyaArg | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gdp | 1.827792 .2068734 8.84 0.000 1.422327 2.233256
distArg | 7.212482 .9982246 7.23 0.000 5.255998 9.168967
contigArg | 13.99 2.315775 6.04 0.000 9.451169 18.52884
comlangArg | 4.975368 .8643526 5.76 0.000 3.281268 6.669468
gmoindex | -8.602154 1.816455 -4.74 0.000 -12.16234 -5.041968
_cons | -93.93674 12.19166 -7.70 0.000 -117.832 -70.04153
------------------------------------------------------------------------------

3. OLS RESULTS (THIS TIME OF A DIFFERENT MODEL WITH FOCUS VARIABLE “LABELING”)
regress soyaBra gdp distBra i.contigBra i.comlangBra Labeling, robust

Linear regression Number of obs = 174
F( 5, 168) = 70.19
Prob > F = 0.0000
R-squared = 0.2891
Root MSE = 4.773

------------------------------------------------------------------------------
| Robust
soyaBra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gdp | 1.759846 .231146 7.61 0.000 1.303521 2.216171
distBra | 1.510898 1.1257 1.34 0.181 -.7114426 3.733239
1.contigBra | -.1928695 1.814515 -0.11 0.915 -3.775057 3.389318
1.comlangBra | 7.658716 .6122588 12.51 0.000 6.450003 8.867428
Labeling | .162902 1.164298 0.14 0.889 -2.135637 2.461441
_cons | -43.44134 9.80927 -4.43 0.000 -62.80666 -24.07603
------------------------------------------------------------------------------

4. TOBIT RESULTStobit soyaBra gdp distBra i.contigBra i.comlangBra Labeling, ll(0)

Tobit regression Number of obs = 174
LR chi2(5) = 58.39
Prob > chi2 = 0.0000
Log likelihood = -369.79639 Pseudo R2 = 0.0732

------------------------------------------------------------------------------
soyaBra | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gdp | 3.390641 .5126487 6.61 0.000 2.378621 4.402661
distBra | 3.839523 2.170289 1.77 0.079 -.444846 8.123893
1.contigBra | 2.505788 3.843805 0.65 0.515 -5.082269 10.09385
1.comlangBra | 12.74682 4.627751 2.75 0.007 3.611175 21.88247
Labeling | -1.256573 2.427927 -0.52 0.605 -6.049544 3.536398
_cons | -99.65792 22.31814 -4.47 0.000 -143.7162 -55.59968
-------------+----------------------------------------------------------------
/sigma | 7.707858 .642088 6.440312 8.975405
------------------------------------------------------------------------------
Obs. summary: 83 left-censored observations at soyaBra<=0
91 uncensored observations
0 right-censored observations

5. PPML RESULTS
*Creating dummy variables*
.
. tabulate contigBra, generate(contigBrazil)

contigBra | Freq. Percent Cum.
------------+-----------------------------------
0 | 162 93.10 93.10
1 | 12 6.90 100.00
------------+-----------------------------------
Total | 174 100.00

. tabulate comlangBra, generate(comlangBrazil)

comlangBra | Freq. Percent Cum.
------------+-----------------------------------
0 | 171 98.28 98.28
1 | 3 1.72 100.00
------------+-----------------------------------
Total | 174 100.00

ppml soyaBra gdp distBra contigBrazil2 contigBrazil1 comlangBrazil2 comlangBrazil1 Labeling, noconstant

note: checking the existence of the estimates
WARNING: soyaBra has very large values, consider rescaling
WARNING: gdp has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 1
Excluded regressors: comlangBrazil1
Number of observations excluded: 0

note: starting ppml estimation
note: soyaBra has noninteger values

Iteration 1: deviance = 8.89e+07
Iteration 2: deviance = 6.12e+07
Iteration 3: deviance = 5.38e+07
Iteration 4: deviance = 5.27e+07
Iteration 5: deviance = 5.27e+07
Iteration 6: deviance = 5.27e+07
Iteration 7: deviance = 5.27e+07
Iteration 8: deviance = 5.27e+07
Iteration 9: deviance = 5.27e+07

Number of parameters: 6
Number of observations: 174
Pseudo log-likelihood: -26332907
R-squared: .87513519
Option strict is: off
--------------------------------------------------------------------------------
| Semirobust
soyaBra | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
gdp | 1.280747 .1877322 6.82 0.000 .9127989 1.648695
distBra | 2.041035 .6487462 3.15 0.002 .7695162 3.312554
contigBrazil2 | -36.75786 5.016741 -7.33 0.000 -46.59049 -26.92523
contigBrazil1 | -37.78452 5.725895 -6.60 0.000 -49.00707 -26.56197
comlangBrazil2 | 2.331063 .5565331 4.19 0.000 1.240278 3.421847
Labeling | 6.048552 1.329141 4.55 0.000 3.443483 8.65362
--------------------------------------------------------------------------------

6. ALTERNATIVE PPML MODEL
ppml soyaBra gdp distBra contigBra comlangBra Labeling

note: checking the existence of the estimates
WARNING: soyaBra has very large values, consider rescaling
WARNING: gdp has very large values, consider rescaling or recentering

Number of regressors excluded to ensure that the estimates exist: 0
Number of observations excluded: 0

note: starting ppml estimation
note: soyaBra has noninteger values

Iteration 1: deviance = 8.89e+07
Iteration 2: deviance = 6.12e+07
Iteration 3: deviance = 5.38e+07
Iteration 4: deviance = 5.27e+07
Iteration 5: deviance = 5.27e+07
Iteration 6: deviance = 5.27e+07
Iteration 7: deviance = 5.27e+07
Iteration 8: deviance = 5.27e+07
Iteration 9: deviance = 5.27e+07

Number of parameters: 6
Number of observations: 174
Pseudo log-likelihood: -26332907
R-squared: .87513519
Option strict is: off
------------------------------------------------------------------------------
| Semirobust
soyaBra | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
gdp | 1.280747 .1877322 6.82 0.000 .9127989 1.648695
distBra | 2.041035 .6487462 3.15 0.002 .7695162 3.312554
contigBra | 1.026663 1.058568 0.97 0.332 -1.048092 3.101418
comlangBra | 2.331063 .5565331 4.19 0.000 1.240278 3.421847
Labeling | 6.048552 1.329141 4.55 0.000 3.443483 8.65362
_cons | -37.78452 5.725895 -6.60 0.000 -49.00707 -26.56197
------------------------------------------------------------------------------
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#6

08 May 2016, 16:12

Dear Erik,

It is not surprising that PPML leads to different results; that is why it is important to use it ;-)

Anyway, I note that when you estimate by OLS and by the Tobit, you are not using the dependent variable in logs. So, the models estimated by OLS and Tobit you have additive effects, but in the Poisson regression you have multiplicative effects. That makes the models difficult to compare.

About the warnings, that is no indication of the presence of outliers. They are there because Stata sometimes has trouble handling variables with very large values. If you rescale everything in millions of dollars you may find that the convergence is quicker, but the main results should not change.

All the best,

Joao
Comment
Killian Foubert

Join Date: Jan 2017

Posts: 44
#7

06 May 2017, 05:59

Originally posted by Joao Santos Silva View Post

Erik,

The standard errors are robust, "semi-robust" is just a label used by Stata. Having said that, it is customary to cluster by country pair.

Looking at your estimation results it looks as if you are indeed estimating a model where some coefficients are not identified. Please estimate the model with dummies for all categories and drop the constant.

Finally, notice that your sample is very small.

Best wishes,

Joao

Dear Joao,

I just want to include robust standard error as well in a PPML regression, do you suggest in your message that it is an option already included in the PPML command?

Thank you,
Killian

Last edited by Killian Foubert; 06 May 2017, 06:04.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#8

06 May 2017, 06:54

Indeed, by default -ppml- gives robust standard errors (but not clustered); if you want to cluster, please use the appropriate option.

Best wishes,

Joao
Comment
Noemi Seng

Join Date: Jan 2024

Posts: 90
#9

24 Jul 2024, 06:22

Dear Joao Santos Silva,

I'm doing a ppmlhdfe regression on FDI data on a country-pair-sector level. Can you tell me whether it would then make sense to cluster my standard errors on country-pair-sector-level or on country pair level? Do I have to do any test to find out which would be appropriate? Thank you very much in advance!

Best wishes
Noemi
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#10

24 Jul 2024, 06:52

Dear Noemi Seng,

If you have enough pairs, I would cluster at the country-pair level to account for the fact that sectors in the same pair may be correlated.

Best wishes,

Joao
Comment
Noemi Seng

Join Date: Jan 2024

Posts: 90
#11

24 Jul 2024, 07:04

Dear Joao Santos Silva,

thank you so much for your quick reply. When clustering on country-pair level, I have 4,079 clusters (271,156 observations); when clustering on country-pair-sector level, I have 25,188 clusters. Would you say those 4,079 clusters point to having enough pairs? codebook country_pair reveals that I have 4,192 country pairs (I'm not quite sure why this results in only 4,079 and not 4,192 clusters).

I appreciate your answer so much.

Best
Noemi
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#12

24 Jul 2024, 07:45

Yes, that should be fine.
Comment
Noemi Seng

Join Date: Jan 2024

Posts: 90
#13

25 Jul 2024, 04:42

Dear Joao Santos Silva

thank you very much. Would you in a country-pair-year level of aggregation cluster the standard errors at the country-pair-level as well? (I'm doing a robustness check where I aggregate the FDI data over sectors per country pair). In general, we cluster the standard errors to account for groupwise heteroskedasticity right? But is there a test for this kind of heteroskedasticity that works with ppmlhdfe? I'm only aware of the xttest3 (modified Wald test) command for panel data.

I really appreciate your advice.

Best,
Noemi
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#14

25 Jul 2024, 04:51

Dear Noemi Seng,

I would cluster at country-pair level, not county-pair-year nor country-pair-sector. We cluster to account for serial correlation, not heteroskedasticity.

Best wishes,

Joao
Comment
Noemi Seng

Join Date: Jan 2024

Posts: 90
#15

26 Jul 2024, 06:51

Dear Joao Santos Silva

thank you for the clarification. May I also ask you: in the country-pair-sector-level regression with standard errors clustered at country-pair-level, which fixed effects would you include? I included time-FE, source country and host country FE, no source*time, host*time FE as they would remove too much variation in my data set.

I would appreciate your thoughts on that.

Best
Noemi
Comment

Announcement

PPML for gravity model estimation, question on CLUSTERING and ROBUST STANDARD ERRORS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment