Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counterfactual Analysis

    Hello!
    I have some questions related to counterfactual analysis
    I have a cross-section data for the year 2015 with 100 observation and I want to make counterfactual Analysis on my regression.
    my model is as follow
    lnY = Bo + B1C +B2X1+B3X2 +B4x3+.........+BnXn+Ui
    C is control variable
    x1,x2,x3.......xn are policy variables with a value of 0 to 2 ( a continuous value from zero to two, zero is low performance and 2 is the best performance)
    my objective is to analyze the impact of "x1 x2 x3....xn" on "y" when all countries move to best performance(2)
    ************Code***************
    *method 1
    regress lny c x1 x2 x3,,,,,,,,,xn // this will give the impact of x on y with the actual performance of countries on each x-variable
    predict yhut, xb
    gen x1new = 2-x1
    gen x2new = 2-x2
    gen x3new = 2-x3
    .....
    gen xnnew = 2-xn
    regress lny c x1new x2new x3new ......xnnew
    predict yhutnew, xb
    sum yhut yhutnew lny
    "yhutnew will be the impact of "xi" on "y" when all countries move to best practice
    ************************************
    ** Method 2
    regress lny x1 x2 x3,,,,,,,,, xn
    predict yhut, xb
    gen yhutnew = yhut- coefficient of x1*(2-x1)-coefficient of x2*(2- x2) - coefficient of x3*(2-x3)-...........coefficient of xn*(2-xn)

    " yhutnew will be the impact of Xi on Y when all countries move to best performance(which is 2) "
    ****************************************
    I need advise on the following point
    1) which methods is correct or if there is any other alternative method of estimating the impact of xi on y when all countries move to best performance
    2) any advise is welcomed







  • #2
    Let's simplify it. Suppose there were only one policy variable and no "control" variables. Then if we do the regression we get y = a + bx + e, for some intercept a and slope b, and an error term e. So E(y|x) = a + b*x. Now the counterfactual is x = 2. So E(y|x = 2) = a + b*2. So the difference is E(y|x=2) - E(y|x) = b*(2-x). Notice that the constant term a drops out when you subtract. The same thing is true, with just more typing, when you have multiple policy variables: the constant term and the "control variable" term will drop out as they do not change when you change the policy variables. So what you really want to estimate is Sum(coefficient of xi * (2-xi)). So I would do this as:

    Code:
    regress lny C x1 x2 x3...xn
    
    forvalues i = /n {
        replace x`i' = 2-x`i'
    }
    predict change, xb
    
    // REMOVE CONSTANT TERM FROM RESULT
    replace change = change - _b[_cons]
    Note: When actually coding this, "n" has to be replaced by an actual number.

    Comment


    • #3
      How about

      Code:
      sysuse auto , clear
      regress price c.mpg
      margins , at((asobserved) mpg) at(mpg = 2) pwcompare
      assuming mpg is the policy variable. Controls and further predictors are easily added.

      Best
      Daniel

      Comment


      • #4
        Thank you all for valuable comments

        Method I Clyde Schechter Approach
        First I want to ask some clarification with Clyde Schechter code
        My code based on your approach is as follow
        regress logY C1 C2 X1 X2
        forvalues i =/ 2 {
        replace X1= 2-x1
        replace X2 = 2-X2
        }

        invalid syntax

        predict changeiny, xb

        replace changeiny = changeiny - constant value of the regression result [_cons]

        My question
        1) I don't understand the value of "n", my data is a continous value from 0 to 2.
        Any comments and suggestions

        Method II

        I found the following result using daniel Klein approach
        Code
        use "C:\Users\habtamu27\ 2015 full cross-section dataset.dta", clear
        regress logY C1 C2 X1 X2
        margins, at ((asobserved) X1 X2) at (X1 =2) at (X2 =2)

        Result
        Predictive margins Number of obs = 95
        Model VCE : OLS

        Expression : Linear prediction, predict()


        2._at : X1 = 2

        3._at : X2 = 2

        ------------------------------------------------------------------------------
        | Delta-method
        | Margin Std. Err. z P>|z| [95% Conf. Interval]
        -------------+----------------------------------------------------------------
        _cons | 3.729382 .1023091 36.45 0.000 3.528859 3.929904
        |
        _at |
        2 | 3.172129 .2388294 13.28 0.000 2.704031 3.640226
        3 | 3.205877 .2342388 13.69 0.000 2.746778 3.664977

        Here is what I understand from the result
        1) 3.172129 is the mean of LogYhut, When all countries move to best performance in "X1", when X1=2 ( when all countries move to best in X1, LogY reduce on average by 3.172129)
        2) 3.205877 is the mean of LogYhut, When all countries move to best performance in "X2", when X2=2 (( when all countries move to best in X2, LogY reduce on average by 3.205877)
        3) 3.729382 is the predicted mean of Logyhut with actual data on x1 and x2.
        Am I correctly understand the result? any comment

        MethodII My other approach
        use "C:\Users\habtamu27\ 2015 full cross section dataset.dta", clear
        regress logY lC1 C2 X1 X2
        predict yhut, xb
        gen ynew2 = constant+(coefficient of C1)*C1+(coefficientof C2)*C2+(coefficient of X1)*2+ (coefficient of X2)*2
        gen ychange2 =ynew2-yhut // this will give us the reduction in logY when all countries move to best practice (which is 2)

        I want to thanks again for any suggestions and comments.
        Thanks





        Comment


        • #5
          regress logY C1 C2 X1 X2
          forvalues i =/ 2 {
          replace X1= 2-x1
          replace X2 = 2-X2
          }

          invalid syntax

          That's my typographical error. It should be -forvalues i = 1/2 {-.

          My question
          1) I don't understand the value of "n", my data is a continous value from 0 to 2.
          Any comments and suggestions
          The value of n should be number of x variables that you want to transform. It's the same n that you used for Xn in post #1 of this thread. It has nothing to do with the fact that you want to transform x into 2-x. If you also happen to have 2 X variables, then that's just a coincidence.

          Also, I don't think your adaptation of Dan Klein's suggestion is quite correct. In Dan's example, there were no covariates (corresponding to C). I think you need the -at()- options to be -at((mean) _all (asobserved) x1 x2)- and -at((mean) _all X1 = 2 X2 = 2)-. Also, you omitted the -pwcompare- option. So the results you are looking at are not what you want.

          Comment


          • #6
            Thanks Clyde Schechter for valuable comments and advices

            I correct Dan Klein's suggestions and it works now. In addition, i also compare it with my method and it gives me the same result. However, your method still gives me different result. I explain the three result as follow
            1. Dan Klein's suggestions

            Code

            use "C:\Users\habtamu27\Desktop\ 2015 full cross section dataset.dta", clear
            regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew
            margins, at((mean)_all(asobserved) FeesandCharges formalitiesdocumentsnew) at((mean)_all FeesandCharges =2 _all formalitiesdocumentsnew =2)

            Result

            Predictive margins Number of obs = 95
            Model VCE : OLS

            Expression : Linear prediction, predict()

            1._at : lnPCGDP = 7.771793 (mean)
            lnsqkm = 12.29731 (mean)

            2._at : lnPCGDP = 7.771793 (mean)
            lnsqkm = 12.29731 (mean)
            FeesandCha~s = 2
            formalitie~w = 2

            ------------------------------------------------------------------------------
            | Delta-method
            | Margin Std. Err. z P>|z| [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            _at |
            1 | 3.729382 .1023091 36.45 0.000 3.528859 3.929904
            2 | 2.648624 .3139972 8.44 0.000 2.033201 3.264047
            Interpretation
            3.729382 is the mean of predicted value of lnPCGDP at observed value
            2.648624 is the mean of predicted value of lnPCGDP when (FeesandCha~s = 2) and (formalitie~w = 2)
            3.750489 is the mean of .lnPCGDP .
            therefore a move to best performance by FeesandCha~s and formalitie~w reduce lnPCGDP on average by -1.080758 (3.729382-2.648624)

            2. My Approach

            Code
            use "C:\Users\habtamu27\2015 full cross section dataset.dta", clear
            regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew
            predict yhut, xb
            gen ynew2 = 3.160265-.2855342*lnPCGDP+.3245949*lnsqkm -.6240358*2-.518051*2 // the numbers are the coefficient of the above estimation
            gen ychange2 =ynew2-yhut

            Result
            . sum logTEDC yhut ynew2 ychange2

            Variable | Obs Mean Std. Dev. Min Max
            -------------+--------------------------------------------------------
            logTEDC | 104 3.750489 1.208117 .6931472 6.548219
            yhut | 95 3.729382 .767478 1.417112 5.278478
            ynew2 | 104 2.536511 .7757382 .1475701 3.877365
            ychange2 | 95 -1.080758 .3974167 -2.076162 -.2590258

            Both Approach1 and 2 gives the same result

            The problem happens when I want to use your suggestion and I think I didn't understand it correctly. mainly why we need to subtract the constant cause I thought it will cancel out.
            here is the code that I use

            CODE
            use "C:\Users\habtamu27\2015 full cross section dataset.dta", clear
            regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew

            Result omited to save space

            CODE

            forvalues i = 1/2 {
            replace FeesandCharges= 2-FeesandCharges
            replace formalitiesdocumentsnew = 2-formalitiesdocumentsnew
            }

            RESULT
            forvalues i = 1/2 {
            2. replace FeesandCharges= 2-FeesandCharges
            3. replace formalitiesdocumentsnew = 2-formalitiesdocumentsnew
            4. }
            (63 real changes made)
            (75 real changes made)
            (63 real changes made)
            (75 real changes made)

            CODE

            . predict change, xb
            (9 missing values generated)

            replace change = change- _b[_cons] // I think the problem is here, you mentioned that I need to subtract _b[_cons ] but what is _b?????
            (95 real changes made)

            CODE
            . sum logTEDC change

            RESULT

            Variable | Obs Mean Std. Dev. Min Max
            -------------+--------------------------------------------------------
            logTEDC | 104 3.750489 1.208117 .6931472 6.548219
            change | 95 .5691161 .767478 -1.743154 2.118212

            I don't understand the value 0.5691161??

            I'm so sorry for the long sentences and explanation, I just want to make all the results clear for further comments.
            Thanks again for any comments and suggestions!

            Comment


            • #7
              Whenever you run any Stata estimation command, the coefficients of the regression are saved in a virtual matrix called _b. So a quick way to refer to, say the coefficient of FeesandCharges would be _b[FeesandCharges]. The same for any other variable. _b[_cons] is the way to get the constant term from the regression.

              The reason you got the wrong results attempting my approach is that you coded it incorrectly. This confusion arose because you, in #1, referred to your variables as X1 and X2. So I wrote code assuming that those were the real variable names, or at least that the variable names ended in 1 and 2. But your actual variable names, it turns out, are not anything 1 and 2, they are just two separate names. So applying the 2-variable transformation should not be done in a -forvalues i = 1/2 - loop the way you have done it, with a separate command for each variable inside. The intent of that loop was to loop over the variables applying the 2-x transformation to each one once. Your code applies the 2-variable transformation twice to each variable, which basically undoes it (2-(2-x) = x). Since your variable names do not actually end in numbers 1 and 2, you can't loop over them with a -forvalues i = 1/2- loop, you have to loop over the variable names instead.

              So here's how you can correctly implement the approach I suggested:

              Code:
              regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew
              foreach v of varlist FeesandCharges formalitiesdocumentsnew {
                  replace `v' = 2-`v'
              }
              predict change, xb
              replace change = change - _b[_cons]
              I think that now you will find that this approach produces the correct results (and will agree with the other two.)


              Comment


              • #8


                Thank ou for the quick response and comment. however, the result is not same.
                CODE
                regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew
                RESULT omitted

                CODE

                foreach v of varlist FeesandCharges formalitiesdocumentsnew {
                replace `v' = 2-`v'
                }
                RESULT
                . foreach v of varlist FeesandCharges formalitiesdocumentsnew {
                2. replace `v' = 2-`v'
                3. }
                (63 real changes made)
                (75 real changes made)

                CODE
                . predict change, xb
                (9 missing values generated)

                . replace change = change - _b[_cons]
                (95 real changes made)

                SUMMARY RESULT
                . sum logTEDC change

                Variable | Obs Mean Std. Dev. Min Max
                -------------+--------------------------------------------------------
                logTEDC | 104 3.750489 1.208117 .6931472 6.548219
                change | 95 .6917754 .7729636 -2.042141 2.233698

                I don't know where is the problem but, it seems something is missing in the code. the mean of "change " before subtracting the constant is greater than mean of predicted ( logTEDC).

                Comment


                • #9
                  You are right. I forgot to also subtract out the covariate terms. So the -replace- command should be:

                  Code:
                  replace change = change - _b[_cons] - _b[logTEDC]*logTEDC - _b[lnPCGDP]*lnPCGDP - _b[lnsqkm]*lnsqkm
                  Now, at this point, this approach is more typing and more complicated than yours and it defeats the spirit of what I was trying to accomplish. I assumed from your original post that you had a lot of X variables and only one or a couple of covariates. If that were the case, this approach would be the simplest way. But because it's the other way around, the simplest approach is probably this:

                  Code:
                  regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew
                  gen change = _b[FeesandCharges]*(2-FeesandCharges) + _b[formalitiesdocumentsnew]*formalitiesdocumentsnew
                  Sorry about the mistakes in my earlier code.

                  Comment


                  • #10
                    Thank You, Clyde Schechter, for the valuable comments and suggestions.
                    Everything Works perfectly and same as above method.

                    Comment


                    • #11
                      Hello Everyone

                      I have some question related to the interpretation of the margins or the change. As I mentioned earlier my model is log-linear. let us assume that when all countries move to the best practice then the change is 10. so does this mean that moving to the best practice reduce TEDC by 10% on average assuming other variable constant or it means on average TEDC reduce by 10 hours. my output variable is in hours.Thanks in advance for any comments and suggestions.

                      Comment


                      • #12
                        The output you get will not automatically transform into the original metric; it is the difference in the expected values in the metric in which the model was estimated at best. See Bill Gould's excellent blog entry on linear regression with log-transformed y; especially the pitfalls when you want predicted values in the original metric.

                        In general, the idea that the coefficients from such models can be interpreted as percentage change is wrong, too. It is approximately correct for (very) small values. For 0.1 we get

                        Code:
                        . display exp(0.1)
                        1.1051709
                        which is roughly a 10 percent increase, but for 0.4 we get

                        Code:
                        . display exp(0.4)
                        1.4918247
                        and the approximation does no longer work well.

                        Best
                        Daniel

                        Comment


                        • #13
                          Thank you, Daniel, for the quick and valuable advice
                          I try to correct my interpretation as follows after reading " Bill Gould's excellent blog entry " explanation about log-linear models. I did not use Poisson regression but I follow his suggestion if I want to use linear regression.
                          • For linear regression of the fom -------ln(yj) = b0 + Xjb + εj
                          • steps to follow if I fit log regressions to obtain predicted yj values are
                          1. Obtain predicted values for ln(yj) = b0 + Xjb.
                          2. Exponentiate the predicted log values.
                          3. Multiply those exponentiated values by exp(σ2/2), where σ2 is the square of the root-mean-square-error (RMSE) of the regression.
                          CODE
                          regress logTEDC GATT lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew i. Geography

                          gen change = _b[FeesandCharges]*(2-FeesandCharges) + _b[formalitiesdocumentsnew]*(2-formalitiesdocumentsnew)
                          gen change2 = -1*change // the change is negative because the two policy variables reduce time to export for DC (TEDC). so I need to change it in to positive but the interpretation will be a reduction of time
                          gen changelevel=exp(change2)
                          gen changefinal =changelevel*exp(e(rmse)^2/2) // this is the actual amount of reduction in TEDC if all countries move to the best practice in the two policy variables assuming other variables constant.
                          RESULT

                          sum TEDC logTEDC change change2 changelevel changefinal

                          Variable | Obs Mean Std. Dev. Min Max
                          -------------+--------------------------------------------------------
                          TEDC | 104 74.95385 92.02677 2 698
                          logTEDC | 104 3.750489 1.208117 .6931472 6.548219
                          change | 95 -.9492753 .3502811 -1.819529 -.2097291
                          change2 | 95 .9492753 .3502811 .2097291 1.819529
                          changelevel | 95 2.752041 1.046726 1.233344 6.168951
                          -------------+--------------------------------------------------------
                          changefinal | 95 4.497189 1.710485 2.015442 10.08086

                          Therefore when countries move to the best practice for "FeesandCharges and formalitiesdocumentsnew" assuming other covariates constant time to export for DC (TEDC) will reduce by 4.497189 hours on average.

                          Question
                          1. Is this interpretation correct? or do I need to use Poisson regression?
                          Thanks in advance for your cooperation.




                          Comment


                          • #14
                            Hello Everyone,

                            I am also applying counterfactual analysis using a probit model in the outcome equation. My main aim is to analyze counterfactual food insecurity of FHHs – reflecting what food insecurity of females would be when the characteristics of the male-heads are swapped into those of females'
                            First, the model is simplified to run separate probit models each for MHH and FHH families as such:
                            MHH_Fd_insec=BMHH XMHH + UMHH if MHH=1 for male-headed families
                            FHH_Fd_insec = BFHH XFHH + UFHH if MHH=0 for female-headed families.
                            I tried the following using Stata 14.1:
                            Code:
                             xtprobit fd_insec yrsch_hd age_hd hhsize occp_hd if MHH==0// presenting the actual estimates for female-headed families
                            predict male_insec
                            xtprobit fd_insec yrsch_hd age_hd hhsize occp_hd if MHH==1 // presenting the actual estimates for male-headed families
                            predict female_insec
                            How can I create COUNTER-FACTUALS of female_food insecurity level by allowing female families assume the characteristics of the males?
                            Your kind help will be appreciated.

                            Ikechukwu

                            Comment

                            Working...
                            X