Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Change in significance when new variables are added

    Hello everyone,

    I ran my probit regression on five variables of interest and six control variables. My dependent variable is green investment (1 if the investor invested in green funds, 0 otherwise). Three of my variables of interest are 'environmental concerns,' 'environmental behavior,' and 'green vote.'

    The initial regression results show a negative but insignificant coefficient for 'environmental concerns' and positive, significant coefficients for 'environmental behavior' and 'green vote.'Then, when I included three additional variables—attitude, confidence, and influence of green investment—to finalize my model, the significance levels changed: 'environmental concerns' became significant with a negative coefficient, 'environmental behavior' became insignificant, and 'green vote' maintained its high significance.

    Does this indicate an issue with my data, or is there a plausible explanation for the changes in significance?
    Normally, when a variable shifts from being significant to insignificant, it can be attributed to omitted variable bias. However, can omitted variable bias also explain a variable transitioning from insignificant to significant?

    The correlation between environmental concerns and environmental behavior is 0.525.

    Thank you,
    Last edited by Serena Menny; 05 Jun 2024, 10:41. Reason: probit

  • #2
    can omitted variable bias also explain a variable transitioning from insignificant to significant?
    Yes. The addition of a new variable to the model (or the removal of a previously included variable) can result in any and every conceivable change in sign or magnitude of the coefficients of the other variables. Only if the new (or removed) variable is independent of all the other model variables, which rarely happens in real life, can you expect the coefficients of other variables to remain the same. With a high correlation like 0.525, very large changes are possible.

    There is, however, one other thing you should look into before attributing the change to omitted variable bias. Remember that in calculating regressions, Stata (and any other statistical package) will omit from the estimation sample any observation that contains a missing value on any variable in the model. When you add a new variable, you create new "opportunities" for observations to drop. So you should first verify that the sample size has not decreased when you added the new variable. If it has decreased, then part or all of the observed change may be due to the change in the estimation sample. And in this case it would make sense to re-estimate the model without the added variable but restricted to the sample that resulted with the added variable included. That enables you to see how much of the change is a confounding (omitted variable) problem and how much is due to loss of observations from the sample.

    Comment


    • #3
      Some of those added variables may be measuring something very similar to the others. Look at the correlation matrix of all the variables. Is attitude or confidence highly correlated with environmental concerns? (and so forth)

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Yes. The addition of a new variable to the model (or the removal of a previously included variable) can result in any and every conceivable change in sign or magnitude of the coefficients of the other variables. Only if the new (or removed) variable is independent of all the other model variables, which rarely happens in real life, can you expect the coefficients of other variables to remain the same. With a high correlation like 0.525, very large changes are possible.

        There is, however, one other thing you should look into before attributing the change to omitted variable bias. Remember that in calculating regressions, Stata (and any other statistical package) will omit from the estimation sample any observation that contains a missing value on any variable in the model. When you add a new variable, you create new "opportunities" for observations to drop. So you should first verify that the sample size has not decreased when you added the new variable. If it has decreased, then part or all of the observed change may be due to the change in the estimation sample. And in this case it would make sense to re-estimate the model without the added variable but restricted to the sample that resulted with the added variable included. That enables you to see how much of the change is a confounding (omitted variable) problem and how much is due to loss of observations from the sample.
        Thank you for your response!
        I verified the number of observations, and it remains consistent across all specifications. Is omitted variable bias the only possible explanation in this case? Does this imply that my results are unreliable? Are there any solutions to address this issue?

        Comment


        • #5
          Originally posted by George Ford View Post
          Some of those added variables may be measuring something very similar to the others. Look at the correlation matrix of all the variables. Is attitude or confidence highly correlated with environmental concerns? (and so forth)
          Thank you for your response.

          All correlations, except for the one between environmental concerns and environmental behavior, are below 0.5. However, we have correlations of 0.44 and 0.42, which do not indicate high correlation I suppose. But maybe enough correlation to cause the changes?

          Additionally, all mean VIF values are below 2.




          Comment


          • #6
            That could do it. I suppose you need to think about whether those added variables are just other measures of the same sort of thing. If so, then you could arguably exclude them and they muddy the waters of interpretation.

            Comment


            • #7
              The risk you run at this point is that you will, consciously or unconsciously, choose the model that give you the results that you are hoping to see. That isn't science.

              What I suggest you do now is draw a diagram that contains all of your variables, connecting them with arrows indicating (presumed) causal relationships among them, a directed acyclic graph. Any variable that is diagrammed as a cause of green investment and also has a causal connection to any of your three explanatory variables of interest must be included in your model (or at least in any model that contains one or more of the connected explanatory variables.) These are crucial confounding variables and any analysis that omits them is likely to exhibit omitted variable bias.

              Any variable which is caused by both green investment and one or more of the explanatory variables of interest is a collider variable, and these must be omitted from the model(s).

              If you have a variable which is caused by one of the explanatory variables but causes green investment, then that is a mediator of the association and you might want to do an analysis that calculates the direct and indirect effects of the explanatory variable(s) involved.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                The risk you run at this point is that you will, consciously or unconsciously, choose the model that give you the results that you are hoping to see. That isn't science.

                What I suggest you do now is draw a diagram that contains all of your variables, connecting them with arrows indicating (presumed) causal relationships among them, a directed acyclic graph. Any variable that is diagrammed as a cause of green investment and also has a causal connection to any of your three explanatory variables of interest must be included in your model (or at least in any model that contains one or more of the connected explanatory variables.) These are crucial confounding variables and any analysis that omits them is likely to exhibit omitted variable bias.

                Any variable which is caused by both green investment and one or more of the explanatory variables of interest is a collider variable, and these must be omitted from the model(s).

                If you have a variable which is caused by one of the explanatory variables but causes green investment, then that is a mediator of the association and you might want to do an analysis that calculates the direct and indirect effects of the explanatory variable(s) involved.
                Thank you so much for this valuable insight! Is there a command that helps draw the directed acyclic graph, or should I use an SEM graph instead?

                I have Stata 14

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  The risk you run at this point is that you will, consciously or unconsciously, choose the model that give you the results that you are hoping to see. That isn't science.

                  What I suggest you do now is draw a diagram that contains all of your variables, connecting them with arrows indicating (presumed) causal relationships among them, a directed acyclic graph. Any variable that is diagrammed as a cause of green investment and also has a causal connection to any of your three explanatory variables of interest must be included in your model (or at least in any model that contains one or more of the connected explanatory variables.) These are crucial confounding variables and any analysis that omits them is likely to exhibit omitted variable bias.

                  Any variable which is caused by both green investment and one or more of the explanatory variables of interest is a collider variable, and these must be omitted from the model(s).

                  If you have a variable which is caused by one of the explanatory variables but causes green investment, then that is a mediator of the association and you might want to do an analysis that calculates the direct and indirect effects of the explanatory variable(s) involved.
                  To rephrase my question more effectively, how can I determine if a variable has a causal connection to my three explanatory variables? Should I run regressions and each time the dependent variable changes?


                  Comment


                  • #10
                    A directed acyclic graph (DAG) is both a conceptual tool for you to work out how you think your variables (as well as unobserved variables) relate to each other and the outcome. If you know the rules for how to read it and determine whether you can estimate a causal effect of interest (called d-separation), you can map a DAG onto a regression model. There are assumptions in a regression model that don't exist in a DAG (e.g., linearity), but it is a very useful tool for thinking about causality. This post on Cross-validated by Robert Long is one of the best short posts I've seen on the ideas behind DAGs and how to apply them.

                    You can use daggity.net to draw your DAG. It will give you all the causal and non-causal paths, which help you to know whether or not to include certain variables in your model, as Clyde noted in post #7, above.
                    Last edited by Erik Ruzek; 06 Jun 2024, 08:24. Reason: Edited CV link

                    Comment

                    Working...
                    X