Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi,

    I have the following doubts with regard to dynamic panel estimation.

    1. As per my understanding, a 'predetermined explanatory variable' is one whose current value is influenced by lag of the error term (effectively, the lag of the dependent variable). Is this understanding correct?

    2. For a predetermined variable, when we use 'model(fodev)' option, we can start the lag length from 0 i.e. lag(0 X). However, is it necessary that we start the lag length from 0 only for predetermined variables. Can we start it from 1 or 2 or 3 and so on?

    3. Can I just ensure reliability of instruments using overall Sargan-Hansen test without paying too much attention to difference-in-Hansen tests?

    4. In some cases, the 2-step weighting matrix Sargan-Hansen validates reliability of instruments but 3-step weighting matrix Sargan-Hansen does not. Can the results be still considered valid?

    5. If I mention the lag length criteria for an instrument as lag(0 0), what does it imply? I suppose this is the default setting in iv() style instruments.

    6. If AR(2) and overall Sargan-Hansen tests are satisfied but the number of instruments is marginally below the number of groups, can the model be considered reliable?

    7. Supposing model results are fine and Sargan-Hansen test also validates reliability of instruments, but AR(2) is not satisfied, what options can be tried to ensure AR(2) is satisfied?

    8. I understand that when we specify variables in iv() or gmmiv() with model(level) option, it means that we assuming the concerned variables to be exogenous. However, since this implies a strong assumption that these variables are uncorrelated with any of the omitted variables including fixed effects, I wonder whether we have any exogenous variable in a real setting. This is because there shall always be a possibility of an explanatory variable being correlated with some omitted variable(s). In light of this, should we always avoid considering a variable to be exogenous to be on a safer side?

    Thanks in anticipation!
    Last edited by Prateek Bedi; 28 Apr 2020, 10:28.

    Comment


      1. Yes.
      2. There is generally no need to start from 0. However, the further you go away from the current period, the weaker the instruments become. Weak instruments can yield imprecise estimates, large standard errors, and unreliable test statistics.
      3. In a world where the Sargan-Hansen test would always give the correct result, you would not need the difference-in-Hansen tests. However, there is the possibility of type-I and type-II errors. Looking at all of the tests might thus provide a more complete picture. The more test results are in your favor, the more trustworthy your results are.
      4. Asymptotically, the tests based on the 2-step or 3-step weighting matrix are equivalent. Large differences typically occur if you have too many or many weak instruments that yield an unstable estimate of the weighting matrix. Personally, I would not feel comfortable with inconsistent results from these two versions of the test. Instead, I would attempt to reduce the number of instruments.
      5. lag(0 0) means that you are just using the contemporaneous value of the specified variable as an instrument. Yes, it is indeed the default setting for the iv() option but that does not mean that the default is always appropriate. In many cases, the default might not be appropriate.
      6. That the number of instruments should be smaller than the number of groups is just a rule of thumb, not a hard criterion. Personally, I would try to keep the number of instruments much smaller than the number of groups to avoid the "too-many-instruments problems".
      7. I would typically try to add some further lags of the dependent variable or some lags of the independent variables as additional regressors. In some cases, the AR(2) test might be less relevant if it can be argued that the instruments are valid nevertheless.
      8. This is a valid concern but depends on the particular application. The question is essentially similar to whether a random-effects model in a static context could be rationalized instead of a fixed-effects model. Some people would argue indeed, that you should never consider variables to be exogenous (with respect to the combined level error term). However, sometimes people are for example interested in the effects of time-invariant variables. In this case, these effects can only be identified by making such strong assumptions.
      https://twitter.com/Kripfganz

      Comment


      • Dear Sebastian Kripfganz,

        Is there a way to take under consideration probability weighting when using your command? To provide some context here is my main model:
        Code:
        areg rdint maq1 dlev2 dsize cashr l.lev2 i.year [pweight=myscore12], absorb(id) vce(robust)
        I implemented inverse propensity weighting (M&A being the treatment) to reduce the selection bias and I now want to run a GMM model. I know I could use [pweight=myscore12] with xtabond2 but I am asking whether this can be somehow taken under consideration with your command as I also need to consider the time effect (that I do with the teffects option in your command).

        Thank you very much in advance!

        Comment


        • The xtdpdgmm command so far does not support weights, sorry. You would need to use xtabond2 command instead and manually specify the non-redundant time dummies.
          https://twitter.com/Kripfganz

          Comment


          • Originally posted by Sebastian Kripfganz View Post
            The xtdpdgmm command so far does not support weights, sorry. You would need to use xtabond2 command instead and manually specify the non-redundant time dummies.
            Dear Sebastian Kripfganz,

            Thank you very much for your prompt reply! When you say non-redundant time dummies, do you mean to specify all the time dummies that are significant at 10% significance level? If yes, would the correct command under the following assumptions be:
            • Year 1998, 1999 and 2001 are significant
            • Using a FOD-transformed model cause of unbalanced panel
            • Y being the dep.var
            • x1, x2 .. xT being predetermined variables
            Code:
            xtabond2 Y l.Y x1 l.x2 .. xT 1998.year 1999.year 2001.year, gmm(l2.Y l.x1 l2.x2  ... l.xT, collapse) iv(1998.year 1999.year 2001.year, equation(level)) small robust ortho

            Lastly, is getting similar results from onestep and twostep an indication that my findings are robust?



            Edit: When I run the command I have written above, some of the time dummies that were significant when I was including all the time dummies, become insignificant. Is this expected?
            Last edited by Dimitris Apostolopoulos; 02 May 2020, 18:46.

            Comment


            • Sorry for the confusion. When I said "non-redundant time dummies", I meant all those dummy variables that are not displayed as "omitted" (or "empty") in the xtabond2 regression output when you use the i.year notation. In other words, all those time dummies that are shown in the xtdpdgmm output when you use the teffects option, irrespective of their statistical significance.

              The reason for not using the i.year notation is that there is a bug in xtabond2 that produces incorrect results for the Sargan/Hansen tests if some of the coefficients are shown as "omitted".

              Unfortunately, there is also another bug in xtabond2 that produces incorrect coefficient estimates when you use the orthogonal option as in your example. I am afraid to say that I am not aware of any feasible solution right now if you want to combine weights and forward-orthogonal deviations (other than maybe reweighting your data manually before estimating the model).

              Examples for both bugs are given in my 2019 London Stata Conference presentation:
              https://twitter.com/Kripfganz

              Comment


              • Dear Sebastian Kripfganz,

                Thank you for your reply! I have gone through your presentation and I have seen the bugs resulting from xtabond2 - this is why I am also asking you.

                I believe GMM would add significant value to my analysis so I would be willing to manually reweight my data but I am afraid I am not sure how to do that. Could you explain to me?

                Comment


                • David Roodman explains the GMM estimator with observation weights in the appendix of his 2009 Stata Journal article "How to do xtabond2: An Introduction to Difference and System GMM in Stata".

                  Unless I am missing something, weighting can be achieved by simply multiplying all observations (dependent variable, regressors, instruments) with the square root of the respective observation weight.

                  On a second thought: According to Roodman's appendix, (implicitly) he first creates lags of the variables (e.g. for the lagged dependent variable in matrix X and the instruments in matrix Z) before applying weights. (This is still possible to do manually but you would first need to generate new variables for each lag used as a regressor or instrument and subsequently supply all of these new variables, after weighting, with the option lag(0 0) to the estimation command.) However, this is not the same as first applying weights and then taking lags. The way Roodman describes it is the most natural way to implement it in a Stata command because the programmer does not need to worry about all the lags, but I am not sure if this econometrically also the most appropriate way of doing it.

                  Spontaneously, I tend to prefer the manual way of applying weights to unlagged variables and then proceeding as usual. While this is easier to do manually, it is harder to automatize from a programmer's perspective (because lags are already specified in the variable list by the user before the estimation command can apply weights). This would be a reason for me not to implement weights at all in xtdpdgmm. (Note that weights are also not implemented in the official xtdpd command suite, possibly for a good reason.)
                  https://twitter.com/Kripfganz

                  Comment


                  • Originally posted by Sebastian Kripfganz View Post
                    1. Yes.
                    2. There is generally no need to start from 0. However, the further you go away from the current period, the weaker the instruments become. Weak instruments can yield imprecise estimates, large standard errors, and unreliable test statistics.
                    3. In a world where the Sargan-Hansen test would always give the correct result, you would not need the difference-in-Hansen tests. However, there is the possibility of type-I and type-II errors. Looking at all of the tests might thus provide a more complete picture. The more test results are in your favor, the more trustworthy your results are.
                    4. Asymptotically, the tests based on the 2-step or 3-step weighting matrix are equivalent. Large differences typically occur if you have too many or many weak instruments that yield an unstable estimate of the weighting matrix. Personally, I would not feel comfortable with inconsistent results from these two versions of the test. Instead, I would attempt to reduce the number of instruments.
                    5. lag(0 0) means that you are just using the contemporaneous value of the specified variable as an instrument. Yes, it is indeed the default setting for the iv() option but that does not mean that the default is always appropriate. In many cases, the default might not be appropriate.
                    6. That the number of instruments should be smaller than the number of groups is just a rule of thumb, not a hard criterion. Personally, I would try to keep the number of instruments much smaller than the number of groups to avoid the "too-many-instruments problems".
                    7. I would typically try to add some further lags of the dependent variable or some lags of the independent variables as additional regressors. In some cases, the AR(2) test might be less relevant if it can be argued that the instruments are valid nevertheless.
                    8. This is a valid concern but depends on the particular application. The question is essentially similar to whether a random-effects model in a static context could be rationalized instead of a fixed-effects model. Some people would argue indeed, that you should never consider variables to be exogenous (with respect to the combined level error term). However, sometimes people are for example interested in the effects of time-invariant variables. In this case, these effects can only be identified by making such strong assumptions.
                    Thanks a lot, Prof. Kripfganz for your superlative guidance, once again. Below are some follow-up queries.

                    1. As you mention, it is indeed advisable to start the lag length from a lag closer to the current period, I suppose we are not constrained to start the lag length from any lag (say 0 or 1 or 2 or 3 and so on) for both predetermined and endogenous variable. Keeping this in view and that we are using model(fodev) option, what is the operational difference between predetermined and endogenous variables because the specification of both these types of variables and their respective instruments seems the same to me. For instance, say X1 is an endogenous variable and X2 is a predetermined variable. Now I can use gmmiv(X1, lag(1 1) model(fodev)) and gmmiv(X2, lag(1 1) model(fodev)) for X1 and X2 respectively. To this extent, is the categorisation of variables as endogenous/predetermined only a theoretical matter?

                    2. For endogenous variables, can we start the lag length from 0, while using model(fodev) option?

                    3. I notice in the help file of xtdpdgmm that we can use two options namely diff and bodev immediately after lag()option in iv() and gmmiv(). What do these options imply? When do we use them? What does it imply if we do not use them at all in our command?

                    Thanks and Regards

                    Prateek

                    Comment


                    • 1. The operational difference between an endogenous variable X1 and a predetermined variable X2 is that for the latter you can use the contemporaneous term in addition to the lags as a valid instrument, i.e. gmmiv(X2, lag(0 1) model(fodev)) versus gmmiv(X1, lag(0 1) model(fodev)). If X2 is indeed predetermined, then using this contemporaneous term as an instrument is highly recommended because it is a stronger instrument than its lags.

                      2. No, you cannot. The contemporaneous term (lag 0) of an endogenous variable is always correlated with the error term, thus not a valid instrument. That is essentially the definition of endogeneity.

                      3. The model(diff) or the model(bodev) option imply a transformation of the regressors. The diff and bodev options imply a transformation of the instruments. Thus, gmmiv(X2, lag(0 1) model(fodev)) implies that untransformed instruments are used for FOD-transformed regressors. gmmiv(X2, lag(0 1) bodev model(fodev)) implies that BOD-transformed instruments are used for FOD-transformed regressors.
                      https://twitter.com/Kripfganz

                      Comment


                      • Originally posted by Sebastian Kripfganz View Post
                        1. The operational difference between an endogenous variable X1 and a predetermined variable X2 is that for the latter you can use the contemporaneous term in addition to the lags as a valid instrument, i.e. gmmiv(X2, lag(0 1) model(fodev)) versus gmmiv(X1, lag(0 1) model(fodev)). If X2 is indeed predetermined, then using this contemporaneous term as an instrument is highly recommended because it is a stronger instrument than its lags.

                        2. No, you cannot. The contemporaneous term (lag 0) of an endogenous variable is always correlated with the error term, thus not a valid instrument. That is essentially the definition of endogeneity.

                        3. The model(diff) or the model(bodev) option imply a transformation of the regressors. The diff and bodev options imply a transformation of the instruments. Thus, gmmiv(X2, lag(0 1) model(fodev)) implies that untransformed instruments are used for FOD-transformed regressors. gmmiv(X2, lag(0 1) bodev model(fodev)) implies that BOD-transformed instruments are used for FOD-transformed regressors.
                        Thanks a lot, Prof. Kripfganz for your helpful answers. I have some more questions.

                        1. In your first point, you mention the same lag lengths for both endogenous variable (X1) and predetermined variable (X2) i.e. gmmiv(X2, lag(0 1) model(fodev)) versus gmmiv(X1, lag(0 1) model(fodev)). Is there a typo here as the lag lengths are same in both cases?

                        2. You mention that using contemporaneous term as an instrument is highly recommended for a predetermined variable but I assume that we may also start the lag length from 1 or 2 or 3 and so on for a predetermined variable. To this extent, if a researcher uses gmmiv(X1, lag(1 1) model(fodev)) and gmmiv(X2, lag(1 1) model(fodev)) for an endogenous variable i.e. X1 and predetermined variable i.e. X2, is there still any operational difference between endogenous and predetermined variable now?

                        3. Although I understand that starting the lag lengths from a lag closer to the contemporaneous value of the endogenous and predetermined variable is recommended. However, is there any restriction on starting the lag length from a lag further away from the contemporaneous value?

                        4. Thanks for clarifying that the diff and bodev options imply a transformation of the instruments. Please also guide me as to how do we decide when do we need to transform instruments and how do we choose between diff and bodev options?

                        Thanks a ton!

                        Prateek

                        Comment


                          1. Sorry. My mistake. It should have been gmmiv(X1, lag(1 1) model(fodev)).
                          2. If you remove valid instruments, the remaining instruments still remain valid. The resulting estimator is thus still consistent, but it becomes less efficient and more vulnerable to a weak-instruments problem. Essentially, you are treating the predetermined variable as if it was an endogenous variable.
                          3. As said in 2., there is no restriction but it is hard to justify. For me as a reader, this would look like fishing for the results that are most in line with your prior beliefs. Unless you provide a good justification for doing this, I would not trust the robustness of your results.
                          4. If you are using the model(diff) or model(fodev) transformation, there is no general guidance on whether to also transform the instruments. It is more a question whether some of these transformation could improve the (finite-sample) properties of the estimators. But this might vary across applications and characteristics of your data set. I would recommend to follow what has been done in the existing empirical literature of your field. Personally, I am not convinced that the bodev option is of any help in practice because it effectively removes one more observation from the estimation sample. The bodev option is only available with model(fodev), in which case the diff option is basically never used. The diff option is usually applied in combination with model(level) to specify (lagged) first differences of the variables as instruments for the untransformed level regressors in a system GMM estimation.
                          Last edited by Sebastian Kripfganz; 04 May 2020, 11:36. Reason: Point 4. amended.
                          https://twitter.com/Kripfganz

                          Comment


                          • Thank you so very much, Prof Kripfganz for your crystal clear answers. I have one more doubt. Is the categorisation of a variable as endogenous/predetermined completely driven by theoretical arguments? Is there an empirical way of figuring out whether a variable should be endogenous or predetermined?

                            Comment


                            • The more you can base your decision on theory, the better. For an empirical approach, please see the section on model selection (slides 90 onwards) in my 2019 London Stata Conference presentation.
                              https://twitter.com/Kripfganz

                              Comment


                              • Alright, Prof. Kripfganz. So do the following specifications indicate that the researcher assumes X1 to be predetermined?

                                1. iv(X1, model(fodev))
                                2. iv(X1, lag(0 2) model(fodev))
                                3. gmmiv(X1, lag(0 1) model(fodev))
                                4. gmmiv(X1, lag(0 0) model(fodev))

                                Thanks!

                                Comment

                                Working...
                                X