Hello,
I am using panel data from 96 countries over the period 2008-2022 (data are available in two year intervals). My primary independent variable is the log of per capita cigarette consumption.
At a broad level, my issue concerns collinearity arising from controlling for the same variable in two different places in a single regression, and when it is a problem. I have seen a lot on this forum that we shouldn’t be all that worried about multicollinearity; but I am concerned that when the model controls for the same variable twice, it might be a special ‘bad’ case.
I am running a series of models that are the same in every respect with the exception of one variable. I want to compare the coefficients of the ‘unique’ variables in each regression, as well as the model R-squareds to inform a view of which of the unique variables offers more explanatory power than the others. These ‘unique’ variables enter the model in log form so that I can interpret their coefficients as elasticities.
Because I am estimating an equation of cigarette demand, my models need to control for income. In my area of study, researchers do this by including GDP per capita as a control in the regressions. However, in one of my equations, the unique variable of interest includes GDP per capita in its composition. The variable is defined as the RIP= (Price of a 20-pack of the most-sold cigarette brand *100)/ GDP per capita, or, (P*100)/Y.
So, if I estimate this model, with the log of GDP per capita as a separate control, I have:
lncigcons_percapitajt = B0 + B1lnRIPjt + B2Yjt + δt + αj + uit
Substituting, RIP = (P*100/Y) gives
lncigcons_percapitajt = B0 + 4.61B1 + B1lnPjt - B1lnYjt + B2Yjt + δt + αj + uit
When I estimate the model in Stata, it runs, and the separate coefficients on RIP and income make sense (results shown also include the unemployment rate, a lagged composite score for non-price tobacco control polciies, and the % of the population that is of working age). However, I am concerned that I am ‘double counting’ GDP per capita in a way that renders my results non-sensical since the log of GDP per capita appears twice in the model.
I thought that perhaps using the GDP growth rate, instead of the log of GDP per capita in all versions of the model may allow me to adequately control for income in all others versions of the model (where the unique variable of interest does not include GDP per capita by definition), while not double counting GDP per capita in the RIP model presented here; but I am unsure since, fundamentally, I am still relying on GDP twice.
Any advice would be greatly appreciated!
Thank you!
Sam
I am using panel data from 96 countries over the period 2008-2022 (data are available in two year intervals). My primary independent variable is the log of per capita cigarette consumption.
At a broad level, my issue concerns collinearity arising from controlling for the same variable in two different places in a single regression, and when it is a problem. I have seen a lot on this forum that we shouldn’t be all that worried about multicollinearity; but I am concerned that when the model controls for the same variable twice, it might be a special ‘bad’ case.
I am running a series of models that are the same in every respect with the exception of one variable. I want to compare the coefficients of the ‘unique’ variables in each regression, as well as the model R-squareds to inform a view of which of the unique variables offers more explanatory power than the others. These ‘unique’ variables enter the model in log form so that I can interpret their coefficients as elasticities.
Because I am estimating an equation of cigarette demand, my models need to control for income. In my area of study, researchers do this by including GDP per capita as a control in the regressions. However, in one of my equations, the unique variable of interest includes GDP per capita in its composition. The variable is defined as the RIP= (Price of a 20-pack of the most-sold cigarette brand *100)/ GDP per capita, or, (P*100)/Y.
So, if I estimate this model, with the log of GDP per capita as a separate control, I have:
lncigcons_percapitajt = B0 + B1lnRIPjt + B2Yjt + δt + αj + uit
Substituting, RIP = (P*100/Y) gives
lncigcons_percapitajt = B0 + 4.61B1 + B1lnPjt - B1lnYjt + B2Yjt + δt + αj + uit
When I estimate the model in Stata, it runs, and the separate coefficients on RIP and income make sense (results shown also include the unemployment rate, a lagged composite score for non-price tobacco control polciies, and the % of the population that is of working age). However, I am concerned that I am ‘double counting’ GDP per capita in a way that renders my results non-sensical since the log of GDP per capita appears twice in the model.
Code:
xtreg lnpccons lnRIP L.highPOWE lnGDPPC_constant_PPP unem wap i.year, robust fe Fixed-effects (within) regression Number of obs = 667 Group variable: id Number of groups = 96 R-squared: Obs per group: Within = 0.5318 min = 5 Between = 0.2836 avg = 6.9 Overall = 0.2907 max = 7 F(11,95) = 16.26 corr(u_i, Xb) = 0.1810 Prob > F = 0.0000 (Std. err. adjusted for 96 clusters in id) -------------------------------------------------------------------------------------- | Robust lnpccons | Coefficient std. err. t P>|t| [95% conf. interval] ---------------------+---------------------------------------------------------------- lnRIP | -.1099073 .0526148 -2.09 0.039 -.2143608 -.0054538 | highPOWE | L1. | -.0066702 .0301874 -0.22 0.826 -.0665997 .0532593 | lnGDPPC_constant_PPP | .3626224 .1902978 1.91 0.060 -.0151666 .7404113 unem | .003714 .0051901 0.72 0.476 -.0065896 .0140175 wap | -.0163203 .0114393 -1.43 0.157 -.0390303 .0063896 | year | 2012 | -.0591595 .0135867 -4.35 0.000 -.0861325 -.0321865 2014 | -.1466417 .0244566 -6.00 0.000 -.1951942 -.0980891 2016 | -.2216274 .0335477 -6.61 0.000 -.2882281 -.1550268 2018 | -.3272128 .0446997 -7.32 0.000 -.415953 -.2384725 2020 | -.4137365 .0488468 -8.47 0.000 -.5107096 -.3167633 2022 | -.4948223 .0559023 -8.85 0.000 -.6058023 -.3838423 | _cons | 4.384663 1.881795 2.33 0.022 .6488282 8.120498 ---------------------+---------------------------------------------------------------- sigma_u | .84577749 sigma_e | .16035323 rho | .96530185 (fraction of variance due to u_i) --------------------------------------------------------------------------------------
Any advice would be greatly appreciated!
Thank you!
Sam
Comment