Including the same variable in two different places in one regression - multicollinearity/double counting

Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#1

Including the same variable in two different places in one regression - multicollinearity/double counting

06 Jun 2024, 19:50

Hello,

I am using panel data from 96 countries over the period 2008-2022 (data are available in two year intervals). My primary independent variable is the log of per capita cigarette consumption.

At a broad level, my issue concerns collinearity arising from controlling for the same variable in two different places in a single regression, and when it is a problem. I have seen a lot on this forum that we shouldn’t be all that worried about multicollinearity; but I am concerned that when the model controls for the same variable twice, it might be a special ‘bad’ case.

I am running a series of models that are the same in every respect with the exception of one variable. I want to compare the coefficients of the ‘unique’ variables in each regression, as well as the model R-squareds to inform a view of which of the unique variables offers more explanatory power than the others. These ‘unique’ variables enter the model in log form so that I can interpret their coefficients as elasticities.

Because I am estimating an equation of cigarette demand, my models need to control for income. In my area of study, researchers do this by including GDP per capita as a control in the regressions. However, in one of my equations, the unique variable of interest includes GDP per capita in its composition. The variable is defined as the RIP= (Price of a 20-pack of the most-sold cigarette brand *100)/ GDP per capita, or, (P*100)/Y.

So, if I estimate this model, with the log of GDP per capita as a separate control, I have:

lncigcons_percapita_jt= B₀ + B₁lnRIP_jt + B₂Y_jt + δ_t + α_j + u_it

Substituting, RIP = (P*100/Y) gives

lncigcons_percapita_jt= B₀ + 4.61B₁+ B₁lnP_jt - B₁lnY_jt + B₂Y_jt + δ_t + α_j + u_it

When I estimate the model in Stata, it runs, and the separate coefficients on RIP and income make sense (results shown also include the unemployment rate, a lagged composite score for non-price tobacco control polciies, and the % of the population that is of working age). However, I am concerned that I am ‘double counting’ GDP per capita in a way that renders my results non-sensical since the log of GDP per capita appears twice in the model.

Code:

xtreg lnpccons lnRIP L.highPOWE lnGDPPC_constant_PPP unem wap i.year, robust fe Fixed-effects (within) regression Number of obs = 667 Group variable: id Number of groups = 96 R-squared: Obs per group: Within = 0.5318 min = 5 Between = 0.2836 avg = 6.9 Overall = 0.2907 max = 7 F(11,95) = 16.26 corr(u_i, Xb) = 0.1810 Prob > F = 0.0000 (Std. err. adjusted for 96 clusters in id) -------------------------------------------------------------------------------------- | Robust lnpccons | Coefficient std. err. t P>|t| [95% conf. interval] ---------------------+---------------------------------------------------------------- lnRIP | -.1099073 .0526148 -2.09 0.039 -.2143608 -.0054538 | highPOWE | L1. | -.0066702 .0301874 -0.22 0.826 -.0665997 .0532593 | lnGDPPC_constant_PPP | .3626224 .1902978 1.91 0.060 -.0151666 .7404113 unem | .003714 .0051901 0.72 0.476 -.0065896 .0140175 wap | -.0163203 .0114393 -1.43 0.157 -.0390303 .0063896 | year | 2012 | -.0591595 .0135867 -4.35 0.000 -.0861325 -.0321865 2014 | -.1466417 .0244566 -6.00 0.000 -.1951942 -.0980891 2016 | -.2216274 .0335477 -6.61 0.000 -.2882281 -.1550268 2018 | -.3272128 .0446997 -7.32 0.000 -.415953 -.2384725 2020 | -.4137365 .0488468 -8.47 0.000 -.5107096 -.3167633 2022 | -.4948223 .0559023 -8.85 0.000 -.6058023 -.3838423 | _cons | 4.384663 1.881795 2.33 0.022 .6488282 8.120498 ---------------------+---------------------------------------------------------------- sigma_u | .84577749 sigma_e | .16035323 rho | .96530185 (fraction of variance due to u_i) --------------------------------------------------------------------------------------

I thought that perhaps using the GDP growth rate, instead of the log of GDP per capita in all versions of the model may allow me to adequately control for income in all others versions of the model (where the unique variable of interest does not include GDP per capita by definition), while not double counting GDP per capita in the RIP model presented here; but I am unsure since, fundamentally, I am still relying on GDP twice.
Any advice would be greatly appreciated!

Thank you!

Sam
Tags: None
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#2

06 Jun 2024, 20:50

Sam: I'm not sure why you want to define the price variable as you did. Why not just use a standard constant elasticity demand function, where you include ln(P) and ln(Y)? It's essentially the same model, it's just the coefficient on ln(Y) is different. The coefficient on ln(P) will still be -.1099 and the coefficient on ln(Y) will be .3626 + .1099 = .4725. This shows an even large income effect and leaves the price elasticity unchanged.
Comment
Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#3

06 Jun 2024, 21:06

Hi Professor Wooldridge,

Thank you for the response! The RIP is the Relative Income Price and tells us what percentage of GDP per capita does it take to buy a 100 packs of 20 of the most-sold cigarettes. It's used in my field as a measure of cigarette affordability. An increase in the RIP means that cigarettes are becoming less affordable. The WHO advises countries that countries should raise taxes such that the RIP should increase over time (i.e. cigarettes become less affordable over time).

Increases in the RIP isn't the only tax policy guidance the WHO provides, and my ultimate goal is to compare the various measures of tax policy performance that the WHO prescribes to try and understand which one has better explained per capita consumption in my sample of countries. For these non-RIP measures, I need to include income as a control to avoid omitted variable bias in my models. However, to be able to compare fairly between models, I do believe that the models should be identical in all respect except the unique tax performance measure of interest; but now I am stuck at my original question of whether it is indeed okay to include the GDP per capita twice in the model: once in the denominator or RIP and once as its own control.

Sam
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#4

06 Jun 2024, 21:52

Okay. There's no problem with your model specification. RIP measures one thing, per capital GDP measures another. However, if you're holding ln(Y) fixed -- as occurs when you include it in the model -- the only way RIP can change is when the price changes. That's why the coefficient on ln(RIP) is the same as the coefficient on ln(price). If it's the RIP variable you care about, and not ln(Y), then it doesn't model. In your model, the coefficient on ln(Y) is the elasticity when of consumption with respect to income when RIP is held fixed, which means the price has to increase to offset the increase in income to keep RIP unchanged.
Comment
Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#5

07 Jun 2024, 01:20

Thank you this response, though I am a bit confused by the sentence -

Originally posted by Jeff Wooldridge View Post

If it's the RIP variable you care about, and not ln(Y), then it doesn't model.

It is indeed the RIP that I care about. I am not all that interested in ln(Y) outside of correctly specifying the model; but I am not sure what is meant by 'it doesn't model', but it does not sound good.

After comparing each of the various models, I wanted to use 'predict' to estimate average per capita consumption (dependent variable) had all countries implemented the WHO's best practice guideline of reducing the RIP by 7.5% each year and then compare this outcome with the average per capita consumption that has manifested in reality (where most did not meet the WHO's 7.5% annual average reduction). However, I am now unsure if this is feasible.

Thank you and apologies in advance that I do not understand!
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#6

07 Jun 2024, 15:27

I'd probably start from each country's RIP in the first year and then created a new series where it drops by 7.5% each year. Then I'd insert that series into the equation for each country to get a prediction for the log consumption variable. Obtain the difference between the actual and the true, and sum those up across time and country. Or, just do it for the last time period if that's more interesting.
Comment
Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#7

07 Jun 2024, 23:19

Yes, that’s the plan.

However, my concern is that

Originally posted by Jeff Wooldridge View Post

If it's the RIP variable you care about, and not ln(Y), then it doesn't model.

if I include lnRIP and lnGDP_per_capita in the same regression. Please can you clarify what is meant by “it doesn’t model”?

On the one hand, it sounds like I should not include both RIP and per capita GDP in the same regression since my log RIP coefficient in the presence of logged GDP per capita is actually just a price elasticity as opposed to an affordability/RIP elasticity. It is the latter that I am interested in. On the other hand,

Originally posted by Jeff Wooldridge View Post

There's no problem with your model specification.

.

Thank you. I really appreciate your time.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2081
#8

08 Jun 2024, 07:12

Typed too quickly. I meant to say "it doesn't matter." You need to hold log(Y) fixed in the calculations. You want to hold income fixed and see what happens when you change RIP. Once you've estimated the coefficients, the log(Y) term plays no role because it's the same in the actual and counterfactual scenarios.
1 like
Comment
Sam Murgatroyd

Join Date: Oct 2023

Posts: 33
#9

09 Jun 2024, 00:44

Thank you for taking the time to respond to my questions, Professor Wooldridge!
Comment

Announcement

Including the same variable in two different places in one regression - multicollinearity/double counting

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment