PPML estimation - regressors excluded

Ruken Kirkan

Join Date: May 2020

Posts: 18
#1

PPML estimation - regressors excluded

24 Jul 2020, 14:27

Hello everyone,
I was hoping someone could help me and tell me what is happening in my estimations. I am doing my final undergraduate thesis and I am really lost.
I am using the PPML estimation method with country-year fixed effects and in my project I am trying to estimate the effect of an epidemic on a country trade with other countries.

My model looks like this:
ppml export loggdpi loggdpj logdist contig comlang_off colony comcur gatt_i gatt_j fta_hmr ebola_only_i ebola_only_j ebola_both i_year* imp_time_fe* exp_time_fe*

where exp_time_fe and imp_time_fe are country-year fixed effects, and i_year are time dummies.
ebola_only_i is a dummy, which is given the value 1 if only origin country is infected with ebola, and ebola_only_j takes value 1 if only destination country is infected with ebola.

In my dataset I have 6 origin countries and 20 destination countries. And I have data for 19 years.

The results after I run my regression looks like this:

. ppml export loggdpi loggdpj logdist contig comlang_off colony comcur gatt_i gatt_j fta_hmr
> ebola_only_i ebola_only_j ebola_both i_year* imp_time_fe* exp_time_fe*

note: checking the existence of the estimates

Number of regressors excluded to ensure that the estimates exist: 24
Excluded regressors: ebola_only_j ebola_both imp_time_fe159 imp_time_fe165 imp_time_fe248 i
> mp_time_fe249 imp_time_fe250 imp_time_fe251 imp_time_fe253 imp_time_fe263 imp_time_fe362 i
> mp_time_fe363 imp_time_fe364 imp_time_fe365 imp_time_fe366 imp_time_fe367 imp_time_fe368 i
> mp_time_fe369 imp_time_fe370 imp_time_fe376 imp_time_fe377 imp_time_fe378 imp_time_fe379 i
> mp_time_fe380
Number of observations excluded: 88

note: i_year1 omitted because of collinearity
note: imp_time_fe39 omitted because of collinearity
note: imp_time_fe57 omitted because of collinearity
note: imp_time_fe89 omitted because of collinearity
note: imp_time_fe153 omitted because of collinearity
note: imp_time_fe156 omitted because of collinearity
note: imp_time_fe160 omitted because of collinearity
note: imp_time_fe167 omitted because of collinearity
note: imp_time_fe169 omitted because of collinearity
note: imp_time_fe182 omitted because of collinearity
note: imp_time_fe185 omitted because of collinearity
note: imp_time_fe237 omitted because of collinearity
......(there are more omitted)

note: starting ppml estimation
note: export has noninteger values

Iteration 1: deviance = 173723.7
Iteration 2: deviance = 116180
Iteration 3: deviance = 103452.9
Iteration 4: deviance = 100695.1
Iteration 5: deviance = 100123.7
Iteration 6: deviance = 99994.54
Iteration 7: deviance = 99962.67
Iteration 8: deviance = 99953.85
Iteration 9: deviance = 99951.34
Iteration 10: deviance = 99950.61
Iteration 11: deviance = 99950.41
Iteration 12: deviance = 99950.36
Iteration 13: deviance = 99950.35
Iteration 14: deviance = 99950.34
Iteration 15: deviance = 99950.34
Iteration 16: deviance = 99950.34
Iteration 17: deviance = 99950.34

Number of parameters: 477
Number of observations: 2230
Pseudo log-likelihood: -52954.354
R-squared: .83270275
Option strict is: off
--------------------------------------------------------------------------------
| Robust
export | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
loggdpi | .9721215 .2282221 4.26 0.000 .5248144 1.419429
loggdpj | 1.062342 .0774175 13.72 0.000 .9106067 1.214078
logdist | -2.231952 .4825553 -4.63 0.000 -3.177743 -1.286161
contig | 1.129319 .6018557 1.88 0.061 -.0502963 2.308935
comlang_off | .2631047 .1177515 2.23 0.025 .032316 .4938934
colony | .7595913 .2081103 3.65 0.000 .3517025 1.16748
comcur | .6996079 .1893037 3.70 0.000 .3285794 1.070636
gatt_i | -1.037189 .6142318 -1.69 0.091 -2.241061 .1666836
gatt_j | -4.393113 .7459926 -5.89 0.000 -5.855232 -2.930995
fta_hmr | -.1689226 1.355726 -0.12 0.901 -2.826097 2.488252
ebola_only_i | .1778964 .5403693 0.33 0.742 -.881208 1.237001
i_year2 | -1.66684 1.271552 -1.31 0.190 -4.159036 .8253567
i_year3 | -3.285794 1.281308 -2.56 0.010 -5.797111 -.7744776
i_year4 | 1.886948 .8058063 2.34 0.019 .3075963 3.466299
i_year5 | 2.266468 1.313945 1.72 0.085 -.3088165 4.841752
i_year6 | 3.357903 1.303507 2.58 0.010 .8030753 5.91273

1) Can you tell me why my variables ebola_only_j ebola_both are omitted as I cannot see why myself?
2) Why are all the other variables omitted as well?

I hope anyone can tell me, as I am really confused about this. 😊
Tags: None
Tom Zylkin

Join Date: Nov 2016

Posts: 188
#2

25 Jul 2020, 18:03

Dear Ruken,
First, this is only indirectly related to the issue you raise but one thing not many people know about Stata syntax is that it drops perfectly collinear variables from right to left. Thus you should always put any fixed effects to the left of the variables you care about when you are specifying the fixed effects as dummies. Otherwise, an easy mistake to make is to report an estimate for a coefficient that is only identified when a fixed effect dummy is dropped. If you check where it says "note: i_year1 omitted because of collinearity", there are many fixed effect dummies being dropped here.

Alternatively, there are better ways to estimate models in fixed effects with some cases. In your particular case, you could consider using the ppmlhdfe command, which does not require creating a unique dummy variable for each fixed effect and is thus usually faster to computer.

To get back to your question, what ppml appears to be telling you is that you have variables in your model that perfectly predict a zero. It looks like most of the variables listed are dummy variables. So I would guess that when these dummies are equal to 1, "export" is always equal to 0. The same may be true for ebola_j if it is a dummy variable. Keep in mind that for ppml to perfectly predict a zero, it must be that one or more coefficients technically need to be equal to either infinity or negative infinity. This is what the output means when it says regressors have been dropped to ensure existence. There are papers by Santos Silva and Tenreyro and Correia, Guimaraes, and Zylkin that explain the issue in more detail.

Regards,
Tom
Comment
Ruken Kirkan

Join Date: May 2020

Posts: 18
#3

26 Jul 2020, 06:06

Dear Tom,
Thank you for your response. I appreciate it very much.

When using the ppmlhdfe should I then not include all my fixed effects variables? I tried this but the ebola dummies were omitted again.

I tried to estimate the ppml model again with less dummy variables, and the ebola variables were omitted again. If I understand this correctly would that mean that the reason these ebola variables are omitted is because they don't tell us anything in reality. Whenever our dependent variable is zero, that this could be where the dummies attain the value 1, and then the dummy doesn't predict anything? When the dummy doesn't predict anything then it is dropped?

Best regards,
Ruken
Comment
Ruken Kirkan

Join Date: May 2020

Posts: 18
#4

26 Jul 2020, 07:17

Dear Tom,
I tried using the ppmlhdfe technique and this is what happened:

. ppmlhdfe export loggdpi loggdpj logdist contig comlang_off gatt_i gatt_j fta_hmr ebola_only_i ebola_only_j ebola_both, absorb(exp_time_fe*, imp_time_fe*)

(warning: absorbing 114 dimensions of fixed effects; check that you really want that)

_assert_abort(): 3498 Invalid options: imp_time_fe *

assert_msg(): - function returned error

GLM::init_fixed_effects(): - function returned error

<istmt>: - function returned error

r(3498);

Is there something I am doing wrong in my estimations? I have looked through forums but can't quite figure out what I am doing wrong.

Best regards,
Ruken
Comment
Tom Zylkin

Join Date: Nov 2016

Posts: 188
#5

26 Jul 2020, 09:31

Originally posted by Ruken Kirkan View Post

Dear Tom,
I tried using the ppmlhdfe technique and this is what happened:

. ppmlhdfe export loggdpi loggdpj logdist contig comlang_off gatt_i gatt_j fta_hmr ebola_only_i ebola_only_j ebola_both, absorb(exp_time_fe*, imp_time_fe*)

(warning: absorbing 114 dimensions of fixed effects; check that you really want that)

_assert_abort(): 3498 Invalid options: imp_time_fe *

assert_msg(): - function returned error

GLM::init_fixed_effects(): - function returned error

<istmt>: - function returned error

r(3498);

Is there something I am doing wrong in my estimations? I have looked through forums but can't quite figure out what I am doing wrong.

Best regards,
Ruken

Dear Ruken,

As it says in the output you should not be absorbing 100+ different variables. Instead, you should only be passing 2 variables to absorb: one with unique IDs for each exporter-time combination, and one with unique IDs for each importer-time combination. One way to do this is demonstrated in the example .do file provided here, which also demonstrates the correct syntax for ppmlhdfe.

As for why ebola_j is dropped, if export always = 0 when ebola_j = 1, then ebola_j perfectly predicts a zero. One other thing I notice though is that ebola_j is indexed only by j. If ebola_j does not vary by both i and j it cannot be identified in the presence of it and jt fixed effects, since these absorb all i- and j- specific predictors of trade.

Regards,
Tom
Comment
Ruken Kirkan

Join Date: May 2020

Posts: 18
#6

26 Jul 2020, 10:47

Dear Tom,
Again I would like to thank you for your help.

I tried to use the example you provided and there seems to be something wrong with my estimations.

. qui ppmlhdfe export loggdpi loggdpj logdist contig comlang_off gatt_i gatt_j fta_hmr ebola

> _only_i ebola_only_j ebola_both, a(country_i#year country_j#year) d

country_i: string variables may not be used as factor variables

stata(): 3598 Stata returned error

fixed_effects(): - function returned error

GLM::init_fixed_effects(): - function returned error

<istmt>: - function returned error

r(3598);

It says that I cannot use country_i because it is a string variable, but when I instead try to include my imp-year and exp-year fixed effects in the a(.) it says that: imp-year ambiguous abbreviation. What am I doing wrong?

Your help is greatly appreciated, thank you.

Best regards,
Ruken
Comment
Tom Zylkin

Join Date: Nov 2016

Posts: 188
#7

26 Jul 2020, 10:52

Dear Ruken,
This will work if you create non-string variables that uniquely identify your exporter and importer. Eg use "egen exp_id = group(country_i)" for the exporter. Then you can do the same for the importer.
Regards,
Tom
Comment
Ruken Kirkan

Join Date: May 2020

Posts: 18
#8

26 Jul 2020, 11:03

Dear Tom,
I did what you said but the results appears as variable in my dataset - I would like to show the coefficients in a table to see what effects the different variables have on my dependent variable. How can I show the results in a table?

Thank you,
Regards,
Ruken
Comment
Tom Zylkin

Join Date: Nov 2016

Posts: 188
#9

26 Jul 2020, 11:06

Originally posted by Ruken Kirkan View Post

Dear Tom,
I did what you said but the results appears as variable in my dataset - I would like to show the coefficients in a table to see what effects the different variables have on my dependent variable. How can I show the results in a table?

Thank you,
Regards,
Ruken

HI Ruken,
Not sure what you mean... there should be output shown after you run the ppmlhdfe command with estimates, ses, p values, etc. If you want to create a regression table that can be exported to a word processing program, there are some other packages you can check out for this such as "estout".

Regards,
Tom
Comment
Ruken Kirkan

Join Date: May 2020

Posts: 18
#10

26 Jul 2020, 11:16

Dear Tom,
I tried running it again, and this is what I got:

. ppmlhdfe export loggdpi loggdpj logdist contig comlang_off gatt_i gatt_j fta_hmr ebola_onl
> y_i ebola_only_j ebola_both, a(exp_id#year imp_id#year)

(dropped 88 observations that are either singletons or separated by a fixed effect)
warning: dependent variable takes very low values after standardizing (3.4567e-09)
note: 8 variables omitted because of collinearity: loggdpi loggdpj gatt_i gatt_j fta_hmr ebo
> la_only_i ebola_only_j ebola_both
Iteration 1: deviance = 1.7576e+05 eps = . iters = 6 tol = 1.0e-04 min(eta) =
> -4.59 P
Iteration 2: deviance = 1.1891e+05 eps = 4.78e-01 iters = 3 tol = 1.0e-04 min(eta) =
> -6.40
Iteration 3: deviance = 1.0636e+05 eps = 1.18e-01 iters = 4 tol = 1.0e-04 min(eta) =
> -8.77
Iteration 4: deviance = 1.0364e+05 eps = 2.62e-02 iters = 4 tol = 1.0e-04 min(eta) =
> -10.67
Iteration 5: deviance = 1.0308e+05 eps = 5.43e-03 iters = 4 tol = 1.0e-04 min(eta) =
> -11.68
Iteration 6: deviance = 1.0296e+05 eps = 1.22e-03 iters = 3 tol = 1.0e-04 min(eta) =
> -12.29
Iteration 7: deviance = 1.0293e+05 eps = 3.03e-04 iters = 2 tol = 1.0e-04 min(eta) =
> -13.54
Iteration 8: deviance = 1.0292e+05 eps = 8.38e-05 iters = 2 tol = 1.0e-04 min(eta) =
> -14.50
Iteration 9: deviance = 1.0291e+05 eps = 2.39e-05 iters = 2 tol = 1.0e-05 min(eta) =
> -15.26
Iteration 10: deviance = 1.0291e+05 eps = 6.89e-06 iters = 2 tol = 1.0e-05 min(eta) =
> -16.20 S
Iteration 11: deviance = 1.0291e+05 eps = 1.94e-06 iters = 2 tol = 1.0e-06 min(eta) =
> -17.12 S
Iteration 12: deviance = 1.0291e+05 eps = 4.97e-07 iters = 2 tol = 1.0e-06 min(eta) =
> -17.92 S
Iteration 13: deviance = 1.0291e+05 eps = 1.09e-07 iters = 2 tol = 1.0e-07 min(eta) =
> -18.81 S
Iteration 14: deviance = 1.0291e+05 eps = 1.97e-08 iters = 2 tol = 1.0e-07 min(eta) =
> -19.67 S
Iteration 15: deviance = 1.0291e+05 eps = 2.85e-09 iters = 2 tol = 1.0e-09 min(eta) =
> -20.35 S O
--------------------------------------------------------------------------------------------
> ----------------
(legend: p: exact partial-out s: exact solver h: step-halving o: epsilon below toleran
> ce)
Converged in 15 iterations and 42 HDFE sub-iterations (tol = 1.0e-08)

HDFE PPML regression No. of obs = 2,230
Absorbing 2 HDFE groups Residual df = 1,755
Wald chi2(3) = 64.95
Deviance = 102913.0865 Prob > chi2 = 0.0000
Log pseudolikelihood = -54435.72593 Pseudo R2 = 0.8426
------------------------------------------------------------------------------
| Robust
export | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
loggdpi | 0 (omitted)
loggdpj | 0 (omitted)
logdist | -2.077776 .5085646 -4.09 0.000 -3.074544 -1.081007
contig | 1.312551 .633628 2.07 0.038 .070663 2.554439
comlang_off | .4285125 .0954937 4.49 0.000 .2413483 .6156768
gatt_i | 0 (omitted)
gatt_j | 0 (omitted)
fta_hmr | 0 (omitted)
ebola_only_i | 0 (omitted)
ebola_only_j | 0 (omitted)
ebola_both | 0 (omitted)
_cons | 23.50971 4.356328 5.40 0.000 14.97146 32.04795
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-------------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
---------------+---------------------------------------|
exp_id#year | 114 0 114 |
imp_id#year | 377 19 358 |
-------------------------------------------------------+

Now most of my variables are omitted. Is there anything I am doing wrong?

Kind Regards,
Ruken
Comment
Tom Zylkin

Join Date: Nov 2016

Posts: 188
#11

26 Jul 2020, 12:05

Hi Ruken,
Your output looks mostly as expected given that you have exporter-time and importer-time fixed effects. It looks like most of the variables that are dropped are country-specific. Thus, they should be absorbed by the fixed effects and cannot be identified here. The same argument applies to "ebola_both" since, if I understand correctly, it is just a linear combination of the two country-specific ebola variables. The one exception is the fta variable, which ordinarily would be identified here. So long as that is coded correctly, the only way it would make sense for it to be dropped is if there are no fta pairs in your data.
Regards,
Tom
Comment
Ruken Kirkan

Join Date: May 2020

Posts: 18
#12

26 Jul 2020, 12:40

Hi Tom,
If I understand correctly, then when using exporter-time and importer-time fixed effects then it will drop all variables that are country-specific. The reason I would like to incorporate the exporter and importer time fixed effects is to account for multilateral resistance. But I see that these fixed effects will drop my variables.Therefore the effect of ebola cannot be identified here. Could this be because of the nature of my ebola variables? Or is there any other way to account for this?

Kind regards,
Ruken
Comment
Tom Zylkin

Join Date: Nov 2016

Posts: 188
#13

26 Jul 2020, 15:29

Originally posted by Ruken Kirkan View Post

Hi Tom,
If I understand correctly, then when using exporter-time and importer-time fixed effects then it will drop all variables that are country-specific. The reason I would like to incorporate the exporter and importer time fixed effects is to account for multilateral resistance. But I see that these fixed effects will drop my variables.Therefore the effect of ebola cannot be identified here. Could this be because of the nature of my ebola variables? Or is there any other way to account for this?

Kind regards,
Ruken

Hi Ruken,
You are correct that ordinarily it is not possible to identify country-specific variables simultaneously with multilateral resistance. One thing you can do is acquire information on internal trade flows for each country and then add a dummy that equals 1 for international trade flows (as opposed to internal flows). If you then also add an interaction between the dummy for international flows and the country-specific ebola dummy, it will tell you how much ebola affected international trade relative to internal trade.

The other option would be to relax the exporter-time and importer-time fixed effects and use country-specific controls instead.

Regards,
Tom
Comment

Ruken Kirkan

Join Date: May 2020
Posts: 18

#14

27 Jul 2020, 04:10

Hi Tom,
I will try to use the other option you recommended. And here are the results:

Code:

. ppmlhdfe export loggdpi loggdpj logdist contig comlang_off gatt_i gatt_j fta_hmr ebola_only_i ebola_only_j ebola_both i.exp_id i.imp_id
warning: dependent variable takes very low values after standardizing (3.5204e-09)
note: 4 variables omitted because of collinearity: ebola_only_j ebola_both 5bn.exp_id 17bn.i
> mp_id
Iteration 1:   deviance = 2.9429e+05  eps = .         iters = 1    tol = 1.0e-04  min(eta) =
>   -5.37  P   
Iteration 2:   deviance = 2.1559e+05  eps = 3.65e-01  iters = 1    tol = 1.0e-04  min(eta) =
>   -7.20      
Iteration 3:   deviance = 2.0040e+05  eps = 7.58e-02  iters = 1    tol = 1.0e-04  min(eta) =
>   -9.17      
Iteration 4:   deviance = 1.9787e+05  eps = 1.28e-02  iters = 1    tol = 1.0e-04  min(eta) =
>  -10.50      
Iteration 5:   deviance = 1.9749e+05  eps = 1.96e-03  iters = 1    tol = 1.0e-04  min(eta) =
>  -11.45      
Iteration 6:   deviance = 1.9739e+05  eps = 4.92e-04  iters = 1    tol = 1.0e-04  min(eta) =
>  -12.45      
Iteration 7:   deviance = 1.9736e+05  eps = 1.60e-04  iters = 1    tol = 1.0e-04  min(eta) =
>  -13.41      
Iteration 8:   deviance = 1.9735e+05  eps = 4.98e-05  iters = 1    tol = 1.0e-04  min(eta) =
>  -14.30      
Iteration 9:   deviance = 1.9735e+05  eps = 1.22e-05  iters = 1    tol = 1.0e-05  min(eta) =
>  -15.03      
Iteration 10:  deviance = 1.9735e+05  eps = 1.68e-06  iters = 1    tol = 1.0e-05  min(eta) =
>  -15.47   S  
Iteration 11:  deviance = 1.9735e+05  eps = 7.18e-08  iters = 1    tol = 1.0e-06  min(eta) =
>  -15.59   S  
Iteration 12:  deviance = 1.9735e+05  eps = 2.75e-10  iters = 1    tol = 1.0e-07  min(eta) =
>  -15.60   S O
--------------------------------------------------------------------------------------------
> ----------------
(legend: p: exact partial-out   s: exact solver   h: step-halving   o: epsilon below toleran
> ce)
Converged in 12 iterations and 12 HDFE sub-iterations (tol = 1.0e-08)

PPML regression                                   No. of obs      =      2,318
                                                  Residual df     =      2,285
                                                  Wald chi2(32)   =    2726.55
Deviance             =   197345.334               Prob > chi2     =     0.0000
Log pseudolikelihood = -101651.8496               Pseudo R2       =     0.7111
------------------------------------------------------------------------------
             |               Robust
      export |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     loggdpi |   .0934371   .1455971     0.64   0.521    -.1919279    .3788022
     loggdpj |   1.824241    .285273     6.39   0.000     1.265116    2.383366
     logdist |  -1.995096   .4898565    -4.07   0.000    -2.955197   -1.034995
      contig |   1.557701   .6209295     2.51   0.012      .340702    2.774701
 comlang_off |    .353602    .122546     2.89   0.004     .1134162    .5937878
      gatt_i |  -2.532956   .3076179    -8.23   0.000    -3.135876   -1.930036
      gatt_j |  -.3391768   .4353503    -0.78   0.436    -1.192448    .5140941
     fta_hmr |   1.027313   1.792266     0.57   0.567    -2.485464    4.540089
ebola_only_i |  -.5931968   .2234592    -2.65   0.008    -1.031169   -.1552249
ebola_only_j |          0  (omitted)
  ebola_both |          0  (omitted)

The ebola_only_j and ebola_both might be omitted because of Collinearity and I don't know how to solve this, but the ebola_only_i has become more significant now, and that is good. Thank you very much.

I have read that the R^2 value is not that important in PPML estimation. Is that the same case here? And overall are these results good?

Again, I would like to thank you for your help Tom. I appreciate it very much.

Kind regards,
Ruken

Comment

Tom Zylkin

Join Date: Nov 2016

Posts: 188
#15

27 Jul 2020, 06:37

Hi Ruken,
It looks to me like you may be having a similar problem as before where you are not correctly specifying the fixed effects. When you are using ppmlhdfe, the fixed effects should be absorbed using the "absorb" option. Otherwise - ie for other commands - fixed effects should be placed to the left of your other variables, not to the right. See the second post in this thread for an explanation.

Basically, you should use "absorb(exp_id imp_id)" instead of specifying each exporter- and importer-specific dummy individually.

An R^2 of .7 isn't anything out of the ordinary.

I'm still curious about the coding of fta_hmr and why it was dropped before. Please make sure to check it is coded correctly.

Regards,
Tom

Last edited by Tom Zylkin; 27 Jul 2020, 06:41.
Comment

Announcement