Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • variants of same variable appearing on LHS and RHS

    A quick, rather general econometrics question: I am running a regression of the form:
    (Y+Z+I)/X=a + bX+Z/X+L/X_epsilon
    So I normalize most variables by X, except X itself, which however also appears as a dependent variable. Furthermore, Z also appears on both LHS and RHS, if not one-to-one. Any ideas whether that could be problematic and or what to do about it? Or simply knowing what the term for such a misspecification would already help me for further google searches, I seem unable to wrap my head around this seemingly simple issue. Many many thanks for your help!

  • #2
    It is not unheard of to normalize many variables by the same variable. For example, one might want to deal with per capita variables throughout. That said, this looks very fishy. It seems like you're guaranteed specific results on Z by construction.

    This might help you access the relevant literature:

    Wiseman, R.M. (2009). On the use and misuse of ratio variables in strategic management research (pp. 75-110). In D. Ketchen & D. Berg (eds.), Research methodology in strategy and management, vol 5. San Diego: Elsevier JAI Press.

    Comment


    • #3
      I do not know about econometrics and can't comment but let me tell you what I don't understand here. I don't understand the _epsilon term where it is coming from if you mean it as an error term. More interesting, how would you interpret the coefficient for Z/X? Because:


      Code:
      LHS: (Y+Z+I)/X
      This is equivalent to writing:

      Code:
      LHS: (Y/X) + (Z/X) +(I/X)

      The exact term Z/X is at both sides of the equation. How does it work in terms of interpretation? I can understand that X' might still have some valid point of interpretation but the Z/X ?
      Roman

      Comment


      • #4

        Another reference is: Kronmal, R.A. 1993. Spurious correlation and the fallacy of the ratio standard revisited. Journal of the Royal Statistical Society. Series A (Statistics in Society) 3, no. 156: 379-392.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Thanks all! In fact, I was not very precise in my question. So I am running this regression: (Y+Z+I)/X=a + bX+cZ/X+dL/X+eD+error. So what I am interested in is coefficient e of my dummy D, the rest is in there as controls. Is this some misspecification or can I do that? It's crazy, I cannot find anything on this on almighty google! Thanks a lot guys

          Comment


          • #6
            From just quickly scanning the literature cited here, it seems almighty google might not be necessary in the first place. It seems, from a quick glance, that putting ratios in regression type models is problematic even if you did not have the same terms on both sides of the equation - which seems odd, although I have not done the math to pin down precisely why.

            Aside from what has been pointed out, it seems not a good idea to drop the conditional effects from the model. Note that e.g. Z/X can be rewritten as Z*X^(-1), which is just a multiplicative (i.e. interaction) term. Usually when including interactions of predictors you want to include the conditional main effects, i.e. X and Z^(-1), in the model as well. Along this line, what would b in your model represent (similar Roman's question above)?

            I am sure there are better ways to control for whatever it is you want to control for. In this spirit, you might be better off telling us more about the substantive question you are trying to answer with this model. What do Y, Z, I, X, L and D stand for?

            Best
            Daniel
            Last edited by daniel klein; 16 Feb 2016, 05:13.

            Comment


            • #7
              Hey Daniel, many thanks for this answer. So this is to asses the effect of exporting on labor productivity where Y=foreign sales, Z=domestic sales, I=(-Inputs), X=employment. So the LHS represents value added per worker, or labor productivity. L stands for other covariates like investment, salaries etc and D is a dummy for exporting. I am interested in b, the coefficient on the exporting dummy, the regression is on the firm level, so I want to control for firm size by including those variables. Hope this gives you a better idea of the problem at hand. Many thanks for your time and advice!

              Comment


              • #8
                ratio variables are, at best, tricky; following is something I originally posted on the old Stata list:

                Recently there was an incomplete discussion of the use of ratios in
                regression. I submit the following as a form of completion (and in part
                because I feel guilty about not completing and then criticizing,
                privately, someone who had submitted something incomplete to the list).

                Ratios are often used in regression to "adjust" or "standardize" for
                some factor such as size. One can divide the ways this is used into two
                classes, one of which is acceptable and the other of which is
                (generally) not acceptable.

                1. Acceptable: If every variable in the regression is divided by the
                same factor there is no problem. This is done for example, when
                turning everything into a "per capita" measurement; another example
                is weighted regression. One needs, however, to be clear regarding
                what is meant by "every variable". Say your regression has two
                predictors (X and Z) and you want to control for population size
                (POP); the basic regressions looks like (suppressing the subscript
                for individual observations):

                Y = b0 + b1X + b2Z + e

                When adjusted for population size, the regression should look like:

                Y/POP = b0/POP + b1(X/POP) + b2(Z/POP) + e/POP

                Leaving out any of these terms will cause problems. See Stata's
                write-up on weighted regression for more on this. (Note that
                inclusion of a constant in this last model is called for in the case
                where the first model includes b3POP.)

                2. Unacceptable: Sometimes it makes no sense to divide all variables by
                the denominator of the ratio; for example, in many health studies
                there is a desire to control for the size of the individual by using
                BMI (body mass index: wt/(ht^2)) as a predictor; another example
                occurs in the study of strength where the desire is to adjust
                strength by the size of the muscle (or muscle fiber); note in the
                latter case that the ratio will now be the response variable. If the
                set of predictors include any demographic variables (e.g., sex, age),
                then clearly one will not want to divide the demographic predictor by
                the denominator of the ratio. The issue here is mostly easily, I
                think, seen by observing that the ratio is an interaction term, but
                that the regression does not (usually) include the accompanying main
                effect terms; this is, among other things, a violation of the
                "marginality" principle (fn. 1). In general, one does not want to
                automatically include an interaction term without its component
                parts. Further, the inclusion of an interaction term has
                implications about the form of the adjustment: use of BMI without
                either height or weight has implications for the way that size is
                adjusted and these implications may be wrong. The answer is to
                multiply out the ratio; e.g., if the ratio is in the response
                variable, multiply everything on the right by the denominator; if the
                ratio is in a predictor, add the component main effects to the model
                and see if the interaction (ratio) adds anything. A good discussion
                of this case, with explicit advice, can be found in Kronmal, R.A.
                (1993), "Spurious correlation and the fallacy of the ratio standard
                revisited," _Journal of the Royal Statistical Society, series A_,
                156: 379-392.

                -----------------------------

                1. For example, including one main effect but not the other implies that
                the intercept but not the slope is independent of the other main
                effect. For more, see Nelder, J.A. (1998), "The selection of terms
                in response-surface models -- How strong is the weak-heredity
                principle?", _The American Statistician_, 52: 315-8.

                Comment


                • #9
                  Not my field of interest at all, sorry. Others might have much better advice.

                  I do not know whether this composite term on the left hand side makes sense for what you are trying to measure. The usual advice would be to look at literature in your field addressing similar problems. Did you ever see a model like the one you propose here?

                  It might very well be my lack of knowledge but would simply including X (employment) as a predictor not achieve what you want? If not so, then why? Why would the model

                  Y+Z+I = b0 + b1*D + b2*X + b3*L

                  not suffice?

                  You are not very clear about the nature of your data. Is this panel data, meaning multiple firms are observed over a period of time? If so, you might be better of with some kind of fixed-effects estimator, getting rid of all the time-invariant firm characteristics? Anyway, this is estimation strategy not model building, so feel free to ignore this last paragraph for the moment.

                  Maybe someone closer to economics has better contributions.

                  Best
                  Daniel
                  Last edited by daniel klein; 16 Feb 2016, 06:17.

                  Comment


                  • #10
                    Thanks Rich, what you say makes total sense. So if I understand you correctly, it does make sense to leave the standardization variable X in if I divide all terms by X, including the constant and the error. So I guess that means that my Z variable should be taken out of there, right? Nevertheless, a dummy as is remains sensible, right?

                    Thanks Daniel also, it is indeed a panel and I am using individual fixed effects, which does control for time invariant heterogeneity like firm size to some extent.

                    Comment


                    • #11
                      I am not an economist and cannot comment on the substance of what you are doing; I notice that there was no citation for my "1."; here is one: Rosenbaum, PR and Rubin, DB (1984), "Difficulties with Regression Analyses of Age-Adjusted Rates," Biometrics, 40: 437-443

                      Comment


                      • #12
                        Let me try to explain better, I am running a panel fixed effects regression of the sort:

                        Click image for larger version

Name:	Screen Shot 2016-02-17 at 14.55.56.png
Views:	1
Size:	67.1 KB
ID:	1327223


                        i is the firm and t is time. So there is some overlap between dependent and independent variables, but what I am looking for is only the coefficient $\phi$ on the dummy variable. The rest is merely controls. Does anyone know whether this is a problem, and what it is called? I include those variables to control for firm size, on top of the time-invariant fixed effect $\pi_{i}$. I'd be super grateful for any hints!! Many thanks!

                        Comment


                        • #13
                          Let's rewrite your equation:

                          \[
                          \ln (Y_{it} + D_{it} - I_{it}) = \alpha + \pi_i + \delta \ln X_{it} + \gamma \ln D_{it} + \theta \ln Z_{it} + \phi DUMMY_{it} + \epsilon_{it}
                          \]

                          If you set \(\delta = (\beta - \gamma - \theta + 1)\) then this is just the same equation as you have set it up. The only thing that changes is the value and interpretation of the coefficent for \(\ln X_{it}\). But you can always switch back and forth between the two representations as you know the relationship between \(\beta\) and \(\delta\). From an econometric perspective, it does not matter which of the two specifications you estimate as it is just a rearrangement of terms.
                          https://www.kripfganz.de/stata/

                          Comment


                          • #14
                            Thanks Sebastian, I was thinking along similar lines. So this would mean that the interpretation of the coefficient \( \phi \) would remain unchanged, right? In my regressions, i care only about that one, the rest is simply controls (even in the re-written form, D appears in both sides). Intuitively, I feel like this inclusion of those controls may affect the R^2 and the interpretation of each coefficient, except the one on the dummy variable. But I cannot substantiate that claim in any way

                            Comment

                            Working...
                            X