variants of same variable appearing on LHS and RHS

Johannes Scharzer

Join Date: Jan 2016

Posts: 8
#1

variants of same variable appearing on LHS and RHS

19 Jan 2016, 07:47

A quick, rather general econometrics question: I am running a regression of the form:
(Y+Z+I)/X=a + bX+Z/X+L/X_epsilon
So I normalize most variables by X, except X itself, which however also appears as a dependent variable. Furthermore, Z also appears on both LHS and RHS, if not one-to-one. Any ideas whether that could be problematic and or what to do about it? Or simply knowing what the term for such a misspecification would already help me for further google searches, I seem unable to wrap my head around this seemingly simple issue. Many many thanks for your help!
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

21 Jan 2016, 15:18

It is not unheard of to normalize many variables by the same variable. For example, one might want to deal with per capita variables throughout. That said, this looks very fishy. It seems like you're guaranteed specific results on Z by construction.

This might help you access the relevant literature:

Wiseman, R.M. (2009). On the use and misuse of ratio variables in strategic management research (pp. 75-110). In D. Ketchen & D. Berg (eds.), Research methodology in strategy and management, vol 5. San Diego: Elsevier JAI Press.
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#3

21 Jan 2016, 18:44

I do not know about econometrics and can't comment but let me tell you what I don't understand here. I don't understand the _epsilon term where it is coming from if you mean it as an error term. More interesting, how would you interpret the coefficient for Z/X? Because:

Code:

LHS: (Y+Z+I)/X

This is equivalent to writing:

Code:

LHS: (Y/X) + (Z/X) +(I/X)

The exact term Z/X is at both sides of the equation. How does it work in terms of interpretation? I can understand that X' might still have some valid point of interpretation but the Z/X ?

Roman
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#4

22 Jan 2016, 12:38

Another reference is: Kronmal, R.A. 1993. Spurious correlation and the fallacy of the ratio standard revisited. Journal of the Royal Statistical Society. Series A (Statistics in Society) 3, no. 156: 379-392.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Johannes Scharzer

Join Date: Jan 2016

Posts: 8
#5

16 Feb 2016, 03:55

Thanks all! In fact, I was not very precise in my question. So I am running this regression: (Y+Z+I)/X=a + bX+cZ/X+dL/X+eD+error. So what I am interested in is coefficient e of my dummy D, the rest is in there as controls. Is this some misspecification or can I do that? It's crazy, I cannot find anything on this on almighty google! Thanks a lot guys
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#6

16 Feb 2016, 05:10

From just quickly scanning the literature cited here, it seems almighty google might not be necessary in the first place. It seems, from a quick glance, that putting ratios in regression type models is problematic even if you did not have the same terms on both sides of the equation - which seems odd, although I have not done the math to pin down precisely why.

Aside from what has been pointed out, it seems not a good idea to drop the conditional effects from the model. Note that e.g. Z/X can be rewritten as Z*X^(-1), which is just a multiplicative (i.e. interaction) term. Usually when including interactions of predictors you want to include the conditional main effects, i.e. X and Z^(-1), in the model as well. Along this line, what would b in your model represent (similar Roman's question above)?

I am sure there are better ways to control for whatever it is you want to control for. In this spirit, you might be better off telling us more about the substantive question you are trying to answer with this model. What do Y, Z, I, X, L and D stand for?

Best
Daniel

Last edited by daniel klein; 16 Feb 2016, 05:13.
Comment
Johannes Scharzer

Join Date: Jan 2016

Posts: 8
#7

16 Feb 2016, 05:19

Hey Daniel, many thanks for this answer. So this is to asses the effect of exporting on labor productivity where Y=foreign sales, Z=domestic sales, I=(-Inputs), X=employment. So the LHS represents value added per worker, or labor productivity. L stands for other covariates like investment, salaries etc and D is a dummy for exporting. I am interested in b, the coefficient on the exporting dummy, the regression is on the firm level, so I want to control for firm size by including those variables. Hope this gives you a better idea of the problem at hand. Many thanks for your time and advice!
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#8

16 Feb 2016, 06:04

ratio variables are, at best, tricky; following is something I originally posted on the old Stata list:

Recently there was an incomplete discussion of the use of ratios in
regression. I submit the following as a form of completion (and in part
because I feel guilty about not completing and then criticizing,
privately, someone who had submitted something incomplete to the list).

Ratios are often used in regression to "adjust" or "standardize" for
some factor such as size. One can divide the ways this is used into two
classes, one of which is acceptable and the other of which is
(generally) not acceptable.

1. Acceptable: If every variable in the regression is divided by the
same factor there is no problem. This is done for example, when
turning everything into a "per capita" measurement; another example
is weighted regression. One needs, however, to be clear regarding
what is meant by "every variable". Say your regression has two
predictors (X and Z) and you want to control for population size
(POP); the basic regressions looks like (suppressing the subscript
for individual observations):

Y = b0 + b1X + b2Z + e

When adjusted for population size, the regression should look like:

Y/POP = b0/POP + b1(X/POP) + b2(Z/POP) + e/POP

Leaving out any of these terms will cause problems. See Stata's
write-up on weighted regression for more on this. (Note that
inclusion of a constant in this last model is called for in the case
where the first model includes b3POP.)

2. Unacceptable: Sometimes it makes no sense to divide all variables by
the denominator of the ratio; for example, in many health studies
there is a desire to control for the size of the individual by using
BMI (body mass index: wt/(ht^2)) as a predictor; another example
occurs in the study of strength where the desire is to adjust
strength by the size of the muscle (or muscle fiber); note in the
latter case that the ratio will now be the response variable. If the
set of predictors include any demographic variables (e.g., sex, age),
then clearly one will not want to divide the demographic predictor by
the denominator of the ratio. The issue here is mostly easily, I
think, seen by observing that the ratio is an interaction term, but
that the regression does not (usually) include the accompanying main
effect terms; this is, among other things, a violation of the
"marginality" principle (fn. 1). In general, one does not want to
automatically include an interaction term without its component
parts. Further, the inclusion of an interaction term has
implications about the form of the adjustment: use of BMI without
either height or weight has implications for the way that size is
adjusted and these implications may be wrong. The answer is to
multiply out the ratio; e.g., if the ratio is in the response
variable, multiply everything on the right by the denominator; if the
ratio is in a predictor, add the component main effects to the model
and see if the interaction (ratio) adds anything. A good discussion
of this case, with explicit advice, can be found in Kronmal, R.A.
(1993), "Spurious correlation and the fallacy of the ratio standard
revisited," _Journal of the Royal Statistical Society, series A_,
156: 379-392.

-----------------------------

1. For example, including one main effect but not the other implies that
the intercept but not the slope is independent of the other main
effect. For more, see Nelder, J.A. (1998), "The selection of terms
in response-surface models -- How strong is the weak-heredity
principle?", _The American Statistician_, 52: 315-8.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3850
#9

16 Feb 2016, 06:13

Not my field of interest at all, sorry. Others might have much better advice.

I do not know whether this composite term on the left hand side makes sense for what you are trying to measure. The usual advice would be to look at literature in your field addressing similar problems. Did you ever see a model like the one you propose here?

It might very well be my lack of knowledge but would simply including X (employment) as a predictor not achieve what you want? If not so, then why? Why would the model

Y+Z+I = b0 + b1*D + b2*X + b3*L

not suffice?

You are not very clear about the nature of your data. Is this panel data, meaning multiple firms are observed over a period of time? If so, you might be better of with some kind of fixed-effects estimator, getting rid of all the time-invariant firm characteristics? Anyway, this is estimation strategy not model building, so feel free to ignore this last paragraph for the moment.

Maybe someone closer to economics has better contributions.

Best
Daniel

Last edited by daniel klein; 16 Feb 2016, 06:17.
1 like
Comment
Johannes Scharzer

Join Date: Jan 2016

Posts: 8
#10

16 Feb 2016, 06:52

Thanks Rich, what you say makes total sense. So if I understand you correctly, it does make sense to leave the standardization variable X in if I divide all terms by X, including the constant and the error. So I guess that means that my Z variable should be taken out of there, right? Nevertheless, a dummy as is remains sensible, right?

Thanks Daniel also, it is indeed a panel and I am using individual fixed effects, which does control for time invariant heterogeneity like firm size to some extent.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4464
#11

16 Feb 2016, 06:57

I am not an economist and cannot comment on the substance of what you are doing; I notice that there was no citation for my "1."; here is one: Rosenbaum, PR and Rubin, DB (1984), "Difficulties with Regression Analyses of Age-Adjusted Rates," Biometrics, 40: 437-443
Comment
Johannes Scharzer

Join Date: Jan 2016

Posts: 8
#12

17 Feb 2016, 06:57

Let me try to explain better, I am running a panel fixed effects regression of the sort:

i is the firm and t is time. So there is some overlap between dependent and independent variables, but what I am looking for is only the coefficient $\phi$ on the dummy variable. The rest is merely controls. Does anyone know whether this is a problem, and what it is called? I include those variables to control for firm size, on top of the time-invariant fixed effect $\pi_{i}$. I'd be super grateful for any hints!! Many thanks!
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2594
#13

17 Feb 2016, 07:12

Let's rewrite your equation:

\[
\ln (Y_{it} + D_{it} - I_{it}) = \alpha + \pi_i + \delta \ln X_{it} + \gamma \ln D_{it} + \theta \ln Z_{it} + \phi DUMMY_{it} + \epsilon_{it}
\]

If you set $\delta = (\beta - \gamma - \theta + 1)$ then this is just the same equation as you have set it up. The only thing that changes is the value and interpretation of the coefficent for $\ln X_{it}$. But you can always switch back and forth between the two representations as you know the relationship between $\beta$ and $\delta$. From an econometric perspective, it does not matter which of the two specifications you estimate as it is just a rearrangement of terms.

https://www.kripfganz.de/stata/
Comment
Johannes Scharzer

Join Date: Jan 2016

Posts: 8
#14

17 Feb 2016, 10:26

Thanks Sebastian, I was thinking along similar lines. So this would mean that the interpretation of the coefficient $ \phi $ would remain unchanged, right? In my regressions, i care only about that one, the rest is simply controls (even in the re-written form, D appears in both sides). Intuitively, I feel like this inclusion of those controls may affect the R^2 and the interpretation of each coefficient, except the one on the dummy variable. But I cannot substantiate that claim in any way
Comment

Announcement

variants of same variable appearing on LHS and RHS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment