Hello,
I am estimating a model using country-level panel data with System GMM. My dataset has 96 countries covering the period 2008-2022 with data available every two years (T=8). The panel is ‘strongly balanced’.
My dependent variable is adult per capita cigarette consumption (proxied by the total value of legal cigarette retail sales on an annual basis divided by the number of adults aged 15 and older in the population in a given year). All values are positive, there are no zeros, and the distribution of this variable, which I have called pccons, is skewed (graph directly below).

The convention in my field is to log per capita consumption and since the coefficients I am interested in are also typically interpreted as elasticities, the relationship between my dependent variable and the variable I am interested in is log-log. I show the logged distribution of my dependent variable (lnpccons) and the logged distribution of the main independent variable I am interested in, lnTTI below. lnTTI is the the log of the total cigarette tax incidence (share of all taxes in the price of a 20-pack of the most sold cigarette brand).


I am especially interested in getting predicted values from the dynamic model that I fit. However, - predict y_hat, xb - gives me the prediction of log per capita consumption, which doesn’t mean anything to me. Example of my situation below.
I have read about xtpoisson, fe with vce(robust) being used even with data that are not count strictly count data in this forum, which is useful because you don’t need to do a log transform to get ‘better’ predictions of the dependent variable in levels. While this may be appropriate for a static model (no lagged dependent variables), I am wondering if there is a way to account for my skewed dependent variable in the System GMM framework so that I can avoid logging the dependent variable?
Thank you!
Sam
I am estimating a model using country-level panel data with System GMM. My dataset has 96 countries covering the period 2008-2022 with data available every two years (T=8). The panel is ‘strongly balanced’.
My dependent variable is adult per capita cigarette consumption (proxied by the total value of legal cigarette retail sales on an annual basis divided by the number of adults aged 15 and older in the population in a given year). All values are positive, there are no zeros, and the distribution of this variable, which I have called pccons, is skewed (graph directly below).
The convention in my field is to log per capita consumption and since the coefficients I am interested in are also typically interpreted as elasticities, the relationship between my dependent variable and the variable I am interested in is log-log. I show the logged distribution of my dependent variable (lnpccons) and the logged distribution of the main independent variable I am interested in, lnTTI below. lnTTI is the the log of the total cigarette tax incidence (share of all taxes in the price of a 20-pack of the most sold cigarette brand).
Code:
hist lnpccons (bin=27, start=3.3852437, width=.18245643)
Code:
hist lnTTI
I am especially interested in getting predicted values from the dynamic model that I fit. However, - predict y_hat, xb - gives me the prediction of log per capita consumption, which doesn’t mean anything to me. Example of my situation below.
Code:
xtset id year, delta(2) Panel variable: id (strongly balanced) Time variable: year, 2008 to 2022 Delta: 2 units . . xi: xtabond2 lnpccons L.lnpccons L.highPOWE lnTTI lnGDPPC unem wap i.year, gmmstyle(L.lnpccons) ivstyle(L.highPOWE lnTTI lnGDPPC unem wap i.year) tw > ostep robust small orthogonal i.year _Iyear_2008-2022 (naturally coded; _Iyear_2008 omitted) Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm. _Iyear_2010 dropped due to collinearity Warning: Two-step estimated covariance matrix of moments is singular. Using a generalized inverse to calculate optimal weighting matrix for two-step estimation. Difference-in-Sargan/Hansen statistics may be negative. Dynamic panel-data estimation, two-step system GMM ------------------------------------------------------------------------------ Group variable: id Number of obs = 668 Time variable : year Number of groups = 96 Number of instruments = 39 Obs per group: min = 5 F(12, 95) = 141108.55 avg = 6.96 Prob > F = 0.000 max = 7 ------------------------------------------------------------------------------ | Corrected lnpccons | Coefficient std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- lnpccons | L1. | 1.030507 .0184528 55.85 0.000 .9938735 1.06714 | highPOWE | L1. | -.0098811 .0107595 -0.92 0.361 -.0312413 .0114791 | lnTTI | -.0694 .0136406 -5.09 0.000 -.09648 -.04232 lnGDPPC | -.0024295 .0023829 -1.02 0.311 -.0071601 .002301 unem | -.0016306 .0016559 -0.98 0.327 -.004918 .0016568 wap | -.0010921 .0013602 -0.80 0.424 -.0037924 .0016082 _Iyear_2012 | .0167286 .0183754 0.91 0.365 -.0197512 .0532084 _Iyear_2014 | -.02187 .0170215 -1.28 0.202 -.0556619 .011922 _Iyear_2016 | -.0006167 .0179358 -0.03 0.973 -.0362238 .0349904 _Iyear_2018 | -.0205002 .0163593 -1.25 0.213 -.0529775 .0119772 _Iyear_2020 | -.0246162 .0175492 -1.40 0.164 -.0594559 .0102234 _Iyear_2022 | .0297032 .021193 1.40 0.164 -.0123701 .0717765 _cons | .1223026 .0643969 1.90 0.061 -.0055415 .2501466 ------------------------------------------------------------------------------ Instruments for orthogonal deviations equation Standard FOD.(L.highPOWE lnTTI lnGDPPC unem wap _Iyear_2010 _Iyear_2012 _Iyear_2014 _Iyear_2016 _Iyear_2018 _Iyear_2020 _Iyear_2022) GMM-type (missing=0, separate instruments for each period unless collapsed) L(1/7).L.lnpccons Instruments for levels equation Standard L.highPOWE lnTTI lnGDPPC unem wap _Iyear_2010 _Iyear_2012 _Iyear_2014 _Iyear_2016 _Iyear_2018 _Iyear_2020 _Iyear_2022 _cons GMM-type (missing=0, separate instruments for each period unless collapsed) D.L.lnpccons ------------------------------------------------------------------------------ Arellano-Bond test for AR(1) in first differences: z = -3.68 Pr > z = 0.000 Arellano-Bond test for AR(2) in first differences: z = -0.97 Pr > z = 0.330 ------------------------------------------------------------------------------ Sargan test of overid. restrictions: chi2(26) = 71.61 Prob > chi2 = 0.000 (Not robust, but not weakened by many instruments.) Hansen test of overid. restrictions: chi2(26) = 27.74 Prob > chi2 = 0.372 (Robust, but weakened by many instruments.) Difference-in-Hansen tests of exogeneity of instrument subsets: GMM instruments for levels Hansen test excluding group: chi2(20) = 18.80 Prob > chi2 = 0.535 Difference (null H = exogenous): chi2(6) = 8.93 Prob > chi2 = 0.177 iv(L.highPOWE lnTTI lnGDPPC unem wap _Iyear_2010 _Iyear_2012 _Iyear_2014 _Iyear_2016 _Iyear_2018 _Iyear_2020 _Iyear_2022) Hansen test excluding group: chi2(15) = 20.67 Prob > chi2 = 0.148 Difference (null H = exogenous): chi2(11) = 7.07 Prob > chi2 = 0.794 . . predict lnpccons_hat, xb (100 missing values generated) . . sum lnpccons_hat Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- lnpccons_hat | 668 6.547359 1.006406 3.251544 8.253994
I have read about xtpoisson, fe with vce(robust) being used even with data that are not count strictly count data in this forum, which is useful because you don’t need to do a log transform to get ‘better’ predictions of the dependent variable in levels. While this may be appropriate for a static model (no lagged dependent variables), I am wondering if there is a way to account for my skewed dependent variable in the System GMM framework so that I can avoid logging the dependent variable?
Thank you!
Sam
Comment