Accounting for the skewed distribution of the dependent variable in system GMM - prediction

Sam Murgatroyd

Join Date: Oct 2023
Posts: 33

Accounting for the skewed distribution of the dependent variable in system GMM - prediction

10 Jun 2024, 21:47

Hello,

I am estimating a model using country-level panel data with System GMM. My dataset has 96 countries covering the period 2008-2022 with data available every two years (T=8). The panel is ‘strongly balanced’.

My dependent variable is adult per capita cigarette consumption (proxied by the total value of legal cigarette retail sales on an annual basis divided by the number of adults aged 15 and older in the population in a given year). All values are positive, there are no zeros, and the distribution of this variable, which I have called pccons, is skewed (graph directly below).

The convention in my field is to log per capita consumption and since the coefficients I am interested in are also typically interpreted as elasticities, the relationship between my dependent variable and the variable I am interested in is log-log. I show the logged distribution of my dependent variable (lnpccons) and the logged distribution of the main independent variable I am interested in, lnTTI below. lnTTI is the the log of the total cigarette tax incidence (share of all taxes in the price of a 20-pack of the most sold cigarette brand).

Code:

 hist lnpccons
(bin=27, start=3.3852437, width=.18245643)

Click image for larger version

Name: lnpccons graph.png
Views: 2
Size: 53.0 KB
ID: 1755847

Code:

hist lnTTI

Click image for larger version

Name: lnTTI.png
Views: 1
Size: 50.7 KB
ID: 1755848

I am especially interested in getting predicted values from the dynamic model that I fit. However, - predict y_hat, xb - gives me the prediction of log per capita consumption, which doesn’t mean anything to me. Example of my situation below.

Code:

 xtset id year, delta(2)

Panel variable: id (strongly balanced)
 Time variable: year, 2008 to 2022
         Delta: 2 units

.
. xi: xtabond2 lnpccons L.lnpccons L.highPOWE lnTTI lnGDPPC unem wap i.year, gmmstyle(L.lnpccons) ivstyle(L.highPOWE lnTTI lnGDPPC unem wap i.year) tw
&gt; ostep robust small orthogonal
i.year            _Iyear_2008-2022    (naturally coded; _Iyear_2008 omitted)
Favoring space over speed. To switch, type or click on mata: mata set matafavor speed, perm.
_Iyear_2010 dropped due to collinearity
Warning: Two-step estimated covariance matrix of moments is singular.
  Using a generalized inverse to calculate optimal weighting matrix for two-step estimation.
  Difference-in-Sargan/Hansen statistics may be negative.

Dynamic panel-data estimation, two-step system GMM
------------------------------------------------------------------------------
Group variable: id                              Number of obs      =       668
Time variable : year                            Number of groups   =        96
Number of instruments = 39                      Obs per group: min =         5
F(12, 95)     = 141108.55                                      avg =      6.96
Prob &gt; F      =     0.000                                      max =         7
------------------------------------------------------------------------------
             |              Corrected
    lnpccons | Coefficient  std. err.      t    P&gt;|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    lnpccons |
         L1. |   1.030507   .0184528    55.85   0.000     .9938735     1.06714
             |
    highPOWE |
         L1. |  -.0098811   .0107595    -0.92   0.361    -.0312413    .0114791
             |
       lnTTI |     -.0694   .0136406    -5.09   0.000      -.09648     -.04232
     lnGDPPC |  -.0024295   .0023829    -1.02   0.311    -.0071601     .002301
        unem |  -.0016306   .0016559    -0.98   0.327     -.004918    .0016568
         wap |  -.0010921   .0013602    -0.80   0.424    -.0037924    .0016082
 _Iyear_2012 |   .0167286   .0183754     0.91   0.365    -.0197512    .0532084
 _Iyear_2014 |    -.02187   .0170215    -1.28   0.202    -.0556619     .011922
 _Iyear_2016 |  -.0006167   .0179358    -0.03   0.973    -.0362238    .0349904
 _Iyear_2018 |  -.0205002   .0163593    -1.25   0.213    -.0529775    .0119772
 _Iyear_2020 |  -.0246162   .0175492    -1.40   0.164    -.0594559    .0102234
 _Iyear_2022 |   .0297032    .021193     1.40   0.164    -.0123701    .0717765
       _cons |   .1223026   .0643969     1.90   0.061    -.0055415    .2501466
------------------------------------------------------------------------------
Instruments for orthogonal deviations equation
  Standard
    FOD.(L.highPOWE lnTTI lnGDPPC unem wap _Iyear_2010 _Iyear_2012 _Iyear_2014
    _Iyear_2016 _Iyear_2018 _Iyear_2020 _Iyear_2022)
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    L(1/7).L.lnpccons
Instruments for levels equation
  Standard
    L.highPOWE lnTTI lnGDPPC unem wap _Iyear_2010 _Iyear_2012 _Iyear_2014
    _Iyear_2016 _Iyear_2018 _Iyear_2020 _Iyear_2022
    _cons
  GMM-type (missing=0, separate instruments for each period unless collapsed)
    D.L.lnpccons
------------------------------------------------------------------------------
Arellano-Bond test for AR(1) in first differences: z =  -3.68  Pr &gt; z =  0.000
Arellano-Bond test for AR(2) in first differences: z =  -0.97  Pr &gt; z =  0.330
------------------------------------------------------------------------------
Sargan test of overid. restrictions: chi2(26)   =  71.61  Prob &gt; chi2 =  0.000
  (Not robust, but not weakened by many instruments.)
Hansen test of overid. restrictions: chi2(26)   =  27.74  Prob &gt; chi2 =  0.372
  (Robust, but weakened by many instruments.)

Difference-in-Hansen tests of exogeneity of instrument subsets:
  GMM instruments for levels
    Hansen test excluding group:     chi2(20)   =  18.80  Prob &gt; chi2 =  0.535
    Difference (null H = exogenous): chi2(6)    =   8.93  Prob &gt; chi2 =  0.177
  iv(L.highPOWE lnTTI lnGDPPC unem wap _Iyear_2010 _Iyear_2012 _Iyear_2014 _Iyear_2016 _Iyear_2018 _Iyear_2020 _Iyear_2022)
    Hansen test excluding group:     chi2(15)   =  20.67  Prob &gt; chi2 =  0.148
    Difference (null H = exogenous): chi2(11)   =   7.07  Prob &gt; chi2 =  0.794


.
. predict lnpccons_hat, xb
(100 missing values generated)

.
. sum lnpccons_hat

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
lnpccons_hat |        668    6.547359    1.006406   3.251544   8.253994

I have read about xtpoisson, fe with vce(robust) being used even with data that are not count strictly count data in this forum, which is useful because you don’t need to do a log transform to get ‘better’ predictions of the dependent variable in levels. While this may be appropriate for a static model (no lagged dependent variables), I am wondering if there is a way to account for my skewed dependent variable in the System GMM framework so that I can avoid logging the dependent variable?

Thank you!

Sam

Attached Files

Tags: dynamic-panel, logs, System GMM, xtabond2, xtpoisson

George Ford

Join Date: Aug 2014

Posts: 3120
#2

11 Jun 2024, 09:23

Check this, but I think e(sigma) is the RMSE.

predict yfit, xb
g yfit2 = (exp(yfit)*exp((`e(sigma)'^2)/2))
Comment
Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#3

11 Jun 2024, 14:08

Thank you for your response, George - I appreciate it!

Can I please clarify with you whether it is correct to use the same RMSE from the fitted model when running trying to obtain a prediction of a counterfactual scenario (using “fake” data)?

Sam
Comment
George Ford

Join Date: Aug 2014

Posts: 3120
#4

11 Jun 2024, 15:12

Hmm. I suspect so. When you exp a prediction from a log DV, you have to adjust for the RMSE. That's all this does. It's just a transformation of the prediction.
1 like
Comment
Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#5

12 Jun 2024, 07:11

Thank you, George!
Comment
George Ford

Join Date: Aug 2014

Posts: 3120
#6

12 Jun 2024, 07:18

You ran [lnY = b*X], then predicted the DV and got predictions of lnY, which you didn't want. You wanted predictions of Y. Typically, the reverse the log you'd just do exp(lnY). But, since it's a prediction from a regression, you make the adjustment.

HTML Code:

https://davegiles.blogspot.com/2014/12/s.html
Comment
Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#7

12 Jun 2024, 12:28

Thanks, George. I was hopeful that there was an alternate like using glm with a log link, or xtpoisson as I have seen elsewhere on this forum, but for the case when you want to do an IV approach designed for a dynamic model with small T like I have, but I'll stick with the adjusted exponential conversion.

Sam
Comment
George Ford

Join Date: Aug 2014

Posts: 3120
#8

12 Jun 2024, 12:36

glm or xtpoisson might work. hard to say without the data.

but if you need xtabond2, probably not.

Last edited by George Ford; 12 Jun 2024, 12:47.
Comment

Announcement

Accounting for the skewed distribution of the dependent variable in system GMM - prediction

Comment

Comment

Comment

Comment

Comment

Comment

Comment