Dear Statalist,
I am a final year undergraduate student working on dissertation titled 'The Effect of Epidemic on International Tourism Flows: The Role of Public Healthcare Spending'. This study comprises information on bilateral tourist arrivals from 191 origin countries to 180 destination countries, forming 15,276 pairs of the countries from 1995 to 2015. The unbalanced panel dataset encompasses 206,171 observations after excluding the missing values. I am interested to find the moderating effect of the pubic healthcare spending on the relationship between international tourism flows and past epidemic outbreaks. With that said, my Y=lfow, X=epidemic_d and interaction term=Below is the description for the variables in the study:
The regression methods I am going to use are FE and PPML.
In FE estimation, I estimated for three specifications. The code are as follows:
The direction and significance in FE model looks fine to me. The regression result for the last specification (with epihgdp) is as follows:
The problem, however, is with the PPML estimation.
I repeated the three specifications above with PPML as a robustness check to FE. Majority of variables become insignificant. The code is as follows:
The result is as follows:
My questions are:
1. I know the level of significance cannot be used to judge whether the regression is a 'good' or a 'bad' one. Instead, it reveals some information to the researchers. Viewing my case, is this possibly caused by my mistakes or its the regression trying to tell me something? What might be the reason behind?
2. I understand PPML is efficient in solving sample selection bias caused by zero observations. Indeed, the bilateral tourism data in this study has large number of missing data. I replaced the missing data with 0 using the following command:
So, does the insignificance in PPML indicates the sensitivity to zero observations?
3. If yes, what test/verification should I conduct next to justify this condition? Any recommendation on articles for me to refer?
4. If no, what should I do next? I have already checked my data and it appears to be correct.
Thank you everyone for your input!
Best regards,
Jacyln Hu.
I am a final year undergraduate student working on dissertation titled 'The Effect of Epidemic on International Tourism Flows: The Role of Public Healthcare Spending'. This study comprises information on bilateral tourist arrivals from 191 origin countries to 180 destination countries, forming 15,276 pairs of the countries from 1995 to 2015. The unbalanced panel dataset encompasses 206,171 observations after excluding the missing values. I am interested to find the moderating effect of the pubic healthcare spending on the relationship between international tourism flows and past epidemic outbreaks. With that said, my Y=lfow, X=epidemic_d and interaction term=Below is the description for the variables in the study:
Code:
1. lflow logarithmic of bilateral tourists arrival between origin and destination 2. flow bilateral tourists arrival between origin and destination 2. lgdp_o logarithmic of GDP per capita at origin (normalised by 10000) 3. lgdp_d logarithmic of GDP per capita at destination (normalised by 10000) 4. ldistw logarithmic of distance between origin and destination 5. lpop_o logarithmic of population in origin country 6. lpop_d logarithmic of population in destination country 7. lRP_od logarithmic of relative price between origin and destination country 8. epidemic_d Share of population affected by epidemic in destination country 9. epidemic_lagged_d Share of population affected by epidemic in destination country (one year lagged) 10. healthgdp_d Public healthcare expenditure (% of GDP) 11. healthgdp_lagged_d Public healthcare expenditure (% of GDP) (one year lagged) 12. epihgdp Interaction term between epidemic_d and healthgdp_d 13. epihgdp_lagged_d Interaction term between epidemic_lagged_d and healthgdp_lagged_d
The regression methods I am going to use are FE and PPML.
In FE estimation, I estimated for three specifications. The code are as follows:
Code:
eststo:xi:xtreg lflow epidemic_d lgdp_o lgdp_d lpop_o lpop_d lRP_od i.year , fe robust
Code:
eststo:xi:xtreg lflow epidemic_d healthgdp_d lgdp_o lgdp_d lpop_o lpop_d lRP_od i.year , fe robust
Code:
eststo:xi:xtreg lflow epidemic_d healthgdp_d epihgdp lgdp_o lgdp_d lpop_o lpop_d lRP_od i.year , fe robust
Code:
Fixed-effects (within) regression Number of obs = 152,289 Group variable: pairid Number of groups = 13,283 R-squared: Obs per group: Within = 0.2618 min = 1 Between = 0.3826 avg = 11.5 Overall = 0.3539 max = 16 F(23,13282) = 418.61 corr(u_i, Xb) = 0.3785 Prob > F = 0.0000 (Std. err. adjusted for 13,283 clusters in pairid) ------------------------------------------------------------------------------ | Robust lflow | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- epidemic_d | -20.10204 6.929843 -2.90 0.004 -33.68552 -6.518561 healthgdp_d | -.0513586 .0044858 -11.45 0.000 -.0601514 -.0425659 epihgdp | 3.198859 1.375551 2.33 0.020 .5025833 5.895135 lgdp_o | .3809364 .0215977 17.64 0.000 .3386018 .423271 lgdp_d | .4053506 .0199294 20.34 0.000 .3662861 .4444152 lpop_o | .1450008 .072187 2.01 0.045 .0035041 .2864976 lpop_d | .1948498 .0729338 2.67 0.008 .0518891 .3378106 lRP_od | .0843827 .0126933 6.65 0.000 .059502 .1092633 _Iyear_1996 | 0 (omitted) _Iyear_1997 | 0 (omitted) _Iyear_1998 | 0 (omitted) _Iyear_1999 | 0 (omitted) _Iyear_2000 | -.3337469 .0305551 -10.92 0.000 -.3936393 -.2738544 _Iyear_2001 | -.3341663 .0294902 -11.33 0.000 -.3919713 -.2763612 _Iyear_2002 | -.3272019 .0280088 -11.68 0.000 -.3821031 -.2723007 _Iyear_2003 | -.3627477 .0250352 -14.49 0.000 -.4118203 -.3136752 _Iyear_2004 | -.3262934 .0222754 -14.65 0.000 -.3699562 -.2826305 _Iyear_2005 | -.3073352 .0198753 -15.46 0.000 -.3462935 -.2683768 _Iyear_2006 | -.2940875 .0174447 -16.86 0.000 -.3282816 -.2598934 _Iyear_2007 | -.3162966 .0154593 -20.46 0.000 -.346599 -.2859942 _Iyear_2008 | -.337098 .0135715 -24.84 0.000 -.3637001 -.3104958 _Iyear_2009 | -.2683101 .012169 -22.05 0.000 -.292163 -.2444572 _Iyear_2010 | -.2578123 .0108087 -23.85 0.000 -.2789988 -.2366258 _Iyear_2011 | -.2565084 .0099581 -25.76 0.000 -.2760276 -.2369891 _Iyear_2012 | -.189837 .0086714 -21.89 0.000 -.2068342 -.1728397 _Iyear_2013 | -.1696128 .007937 -21.37 0.000 -.1851704 -.1540552 _Iyear_2014 | -.1530084 .0067724 -22.59 0.000 -.1662833 -.1397336 _Iyear_2015 | 0 (omitted) _cons | -.1052812 .3171258 -0.33 0.740 -.726893 .5163306 -------------+---------------------------------------------------------------- sigma_u | 2.8881156 sigma_e | .6123985 rho | .95697322 (fraction of variance due to u_i) ------------------------------------------------------------------------------
I repeated the three specifications above with PPML as a robustness check to FE. Majority of variables become insignificant. The code is as follows:
Code:
eststo:xi:ppmlhdfe flow epidemic_d lgdp_o lgdp_d lpop_o lpop_d lRP_od, a(year pairid) nolog
Code:
eststo:xi:ppmlhdfe flow epidemic_d healthgdp_d lgdp_o lgdp_d lpop_o lpop_d lRP_od, a(year pairid ) nolog
Code:
eststo:xi:ppmlhdfe flow epidemic_d healthgdp_d epihgdp lgdp_o lgdp_d lpop_o lpop_d lRP_od , a(year pairid) nolog
Code:
HDFE PPML regression No. of obs = 208,926 Absorbing 2 HDFE groups Residual df = 195,620 Wald chi2(8) = 202.95 Deviance = 7.77752e+17 Prob > chi2 = 0.0000 Log pseudolikelihood = -3.88876e+17 Pseudo R2 = 0.9982 ------------------------------------------------------------------------------ | Robust flow | Coefficient std. err. z P>|z| [95% conf. interval] -------------+---------------------------------------------------------------- epidemic_d | -194.7957 659.987 -0.30 0.768 -1488.347 1098.755 healthgdp_d | -.3241675 .0399636 -8.11 0.000 -.4024947 -.2458403 epihgdp | 2.260448 108.162 0.02 0.983 -209.7333 214.2542 lgdp_o | .2913518 .1665717 1.75 0.080 -.0351228 .6178263 lgdp_d | .0382627 .1125117 0.34 0.734 -.1822561 .2587816 lpop_o | 5.494733 .7255132 7.57 0.000 4.072754 6.916713 lpop_d | -1.594065 .7285146 -2.19 0.029 -3.021927 -.1662026 lRP_od | .3781714 .1781541 2.12 0.034 .0289959 .727347 _cons | 35.84048 3.988092 8.99 0.000 28.02396 43.65699 ------------------------------------------------------------------------------ Absorbed degrees of freedom: -----------------------------------------------------+ Absorbed FE | Categories - Redundant = Num. Coefs | -------------+---------------------------------------| year | 16 0 16 | pairid | 13283 1 13282 | -----------------------------------------------------+
My questions are:
1. I know the level of significance cannot be used to judge whether the regression is a 'good' or a 'bad' one. Instead, it reveals some information to the researchers. Viewing my case, is this possibly caused by my mistakes or its the regression trying to tell me something? What might be the reason behind?
2. I understand PPML is efficient in solving sample selection bias caused by zero observations. Indeed, the bilateral tourism data in this study has large number of missing data. I replaced the missing data with 0 using the following command:
Code:
replace flow=0 if flow==.
3. If yes, what test/verification should I conduct next to justify this condition? Any recommendation on articles for me to refer?
4. If no, what should I do next? I have already checked my data and it appears to be correct.
Thank you everyone for your input!
Best regards,
Jacyln Hu.