Counterfactual Analysis

Habtesh Amogne

Join Date: Jan 2018

Posts: 9
#1

Counterfactual Analysis

23 Jan 2018, 08:34

Hello!
I have some questions related to counterfactual analysis
I have a cross-section data for the year 2015 with 100 observation and I want to make counterfactual Analysis on my regression.
my model is as follow
lnY = Bo + B1C +B2X1+B3X2 +B4x3+.........+BnXn+Ui
C is control variable
x1,x2,x3.......xn are policy variables with a value of 0 to 2 ( a continuous value from zero to two, zero is low performance and 2 is the best performance)
my objective is to analyze the impact of "x1 x2 x3....xn" on "y" when all countries move to best performance(2)
************Code***************
*method 1
regress lny c x1 x2 x3,,,,,,,,,xn // this will give the impact of x on y with the actual performance of countries on each x-variable
predict yhut, xb
gen x1new = 2-x1
gen x2new = 2-x2
gen x3new = 2-x3
.....
gen xnnew = 2-xn
regress lny c x1new x2new x3new ......xnnew
predict yhutnew, xb
sum yhut yhutnew lny
"yhutnew will be the impact of "xi" on "y" when all countries move to best practice
************************************
** Method 2
regress lny x1 x2 x3,,,,,,,,, xn
predict yhut, xb
gen yhutnew = yhut- coefficient of x1*(2-x1)-coefficient of x2*(2- x2) - coefficient of x3*(2-x3)-...........coefficient of xn*(2-xn)

" yhutnew will be the impact of Xi on Y when all countries move to best performance(which is 2) "
****************************************
I need advise on the following point
1) which methods is correct or if there is any other alternative method of estimating the impact of xi on y when all countries move to best performance
2) any advise is welcomed
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

23 Jan 2018, 08:52

Let's simplify it. Suppose there were only one policy variable and no "control" variables. Then if we do the regression we get y = a + bx + e, for some intercept a and slope b, and an error term e. So E(y|x) = a + b*x. Now the counterfactual is x = 2. So E(y|x = 2) = a + b*2. So the difference is E(y|x=2) - E(y|x) = b*(2-x). Notice that the constant term a drops out when you subtract. The same thing is true, with just more typing, when you have multiple policy variables: the constant term and the "control variable" term will drop out as they do not change when you change the policy variables. So what you really want to estimate is Sum(coefficient of x_i * (2-x_i)). So I would do this as:

Code:

regress lny C x1 x2 x3...xn forvalues i = /n { replace x`i' = 2-x`i' } predict change, xb // REMOVE CONSTANT TERM FROM RESULT replace change = change - _b[_cons]

Note: When actually coding this, "n" has to be replaced by an actual number.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#3

23 Jan 2018, 09:04

How about

Code:

sysuse auto , clear regress price c.mpg margins , at((asobserved) mpg) at(mpg = 2) pwcompare

assuming mpg is the policy variable. Controls and further predictors are easily added.

Best
Daniel
Comment
Habtesh Amogne

Join Date: Jan 2018

Posts: 9
#4

24 Jan 2018, 10:45

Thank you all for valuable comments

Method I Clyde Schechter Approach
First I want to ask some clarification with Clyde Schechter code
My code based on your approach is as follow
regress logY C1 C2 X1 X2
forvalues i =/ 2 {
replace X1= 2-x1
replace X2 = 2-X2
}

invalid syntax

predict changeiny, xb
replace changeiny = changeiny - constant value of the regression result [_cons]

My question
1) I don't understand the value of "n", my data is a continous value from 0 to 2.
Any comments and suggestions

Method II
I found the following result using daniel Klein approach
Code
use "C:\Users\habtamu27\ 2015 full cross-section dataset.dta", clear
regress logY C1 C2 X1 X2
margins, at ((asobserved) X1 X2) at (X1 =2) at (X2 =2)

Result
Predictive margins Number of obs = 95
Model VCE : OLS

Expression : Linear prediction, predict()

2._at : X1 = 2

3._at : X2 = 2

------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 3.729382 .1023091 36.45 0.000 3.528859 3.929904
|
_at |
2 | 3.172129 .2388294 13.28 0.000 2.704031 3.640226
3 | 3.205877 .2342388 13.69 0.000 2.746778 3.664977

Here is what I understand from the result
1) 3.172129 is the mean of LogYhut, When all countries move to best performance in "X1", when X1=2 ( when all countries move to best in X1, LogY reduce on average by 3.172129)
2) 3.205877 is the mean of LogYhut, When all countries move to best performance in "X2", when X2=2 (( when all countries move to best in X2, LogY reduce on average by 3.205877)
3) 3.729382 is the predicted mean of Logyhut with actual data on x1 and x2.
Am I correctly understand the result? any comment

MethodII My other approach
use "C:\Users\habtamu27\ 2015 full cross section dataset.dta", clear
regress logY lC1 C2 X1 X2
predict yhut, xb
gen ynew2 = constant+(coefficient of C1)*C1+(coefficientof C2)*C2+(coefficient of X1)*2+ (coefficient of X2)*2
gen ychange2 =ynew2-yhut // this will give us the reduction in logY when all countries move to best practice (which is 2)

I want to thanks again for any suggestions and comments.
Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

24 Jan 2018, 11:53

regress logY C1 C2 X1 X2
forvalues i =/ 2 {
replace X1= 2-x1
replace X2 = 2-X2
}

invalid syntax

That's my typographical error. It should be -forvalues i = 1/2 {-.

My question
1) I don't understand the value of "n", my data is a continous value from 0 to 2.
Any comments and suggestions

The value of n should be number of x variables that you want to transform. It's the same n that you used for Xn in post #1 of this thread. It has nothing to do with the fact that you want to transform x into 2-x. If you also happen to have 2 X variables, then that's just a coincidence.

Also, I don't think your adaptation of Dan Klein's suggestion is quite correct. In Dan's example, there were no covariates (corresponding to C). I think you need the -at()- options to be -at((mean) _all (asobserved) x1 x2)- and -at((mean) _all X1 = 2 X2 = 2)-. Also, you omitted the -pwcompare- option. So the results you are looking at are not what you want.
Comment
Habtesh Amogne

Join Date: Jan 2018

Posts: 9
#6

24 Jan 2018, 22:50

Thanks Clyde Schechter for valuable comments and advices

I correct Dan Klein's suggestions and it works now. In addition, i also compare it with my method and it gives me the same result. However, your method still gives me different result. I explain the three result as follow
1. Dan Klein's suggestions

Code
use "C:\Users\habtamu27\Desktop\ 2015 full cross section dataset.dta", clear
regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew
margins, at((mean)_all(asobserved) FeesandCharges formalitiesdocumentsnew) at((mean)_all FeesandCharges =2 _all formalitiesdocumentsnew =2)

Result

Predictive margins Number of obs = 95
Model VCE : OLS

Expression : Linear prediction, predict()

1._at : lnPCGDP = 7.771793 (mean)
lnsqkm = 12.29731 (mean)

2._at : lnPCGDP = 7.771793 (mean)
lnsqkm = 12.29731 (mean)
FeesandCha~s = 2
formalitie~w = 2

------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_at |
1 | 3.729382 .1023091 36.45 0.000 3.528859 3.929904
2 | 2.648624 .3139972 8.44 0.000 2.033201 3.264047
Interpretation
3.729382 is the mean of predicted value of lnPCGDP at observed value
2.648624 is the mean of predicted value of lnPCGDP when (FeesandCha~s = 2) and (formalitie~w = 2)
3.750489 is the mean of .lnPCGDP .
therefore a move to best performance by FeesandCha~s and formalitie~w reduce lnPCGDP on average by -1.080758 (3.729382-2.648624)

2. My Approach

Code
use "C:\Users\habtamu27\2015 full cross section dataset.dta", clear
regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew
predict yhut, xb
gen ynew2 = 3.160265-.2855342*lnPCGDP+.3245949*lnsqkm -.6240358*2-.518051*2 // the numbers are the coefficient of the above estimation
gen ychange2 =ynew2-yhut

Result
. sum logTEDC yhut ynew2 ychange2

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
logTEDC | 104 3.750489 1.208117 .6931472 6.548219
yhut | 95 3.729382 .767478 1.417112 5.278478
ynew2 | 104 2.536511 .7757382 .1475701 3.877365
ychange2 | 95 -1.080758 .3974167 -2.076162 -.2590258

Both Approach1 and 2 gives the same result

The problem happens when I want to use your suggestion and I think I didn't understand it correctly. mainly why we need to subtract the constant cause I thought it will cancel out.
here is the code that I use

CODE
use "C:\Users\habtamu27\2015 full cross section dataset.dta", clear
regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew

Result omited to save space

CODE
forvalues i = 1/2 {
replace FeesandCharges= 2-FeesandCharges
replace formalitiesdocumentsnew = 2-formalitiesdocumentsnew
}

RESULT
forvalues i = 1/2 {
2. replace FeesandCharges= 2-FeesandCharges
3. replace formalitiesdocumentsnew = 2-formalitiesdocumentsnew
4. }
(63 real changes made)
(75 real changes made)
(63 real changes made)
(75 real changes made)

CODE

. predict change, xb
(9 missing values generated)

replace change = change- _b[_cons] // I think the problem is here, you mentioned that I need to subtract _b[_cons ] but what is _b?????
(95 real changes made)

CODE
. sum logTEDC change

RESULT

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
logTEDC | 104 3.750489 1.208117 .6931472 6.548219
change | 95 .5691161 .767478 -1.743154 2.118212

I don't understand the value 0.5691161??

I'm so sorry for the long sentences and explanation, I just want to make all the results clear for further comments.
Thanks again for any comments and suggestions!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#7

24 Jan 2018, 23:20

Whenever you run any Stata estimation command, the coefficients of the regression are saved in a virtual matrix called _b. So a quick way to refer to, say the coefficient of FeesandCharges would be _b[FeesandCharges]. The same for any other variable. _b[_cons] is the way to get the constant term from the regression.

The reason you got the wrong results attempting my approach is that you coded it incorrectly. This confusion arose because you, in #1, referred to your variables as X1 and X2. So I wrote code assuming that those were the real variable names, or at least that the variable names ended in 1 and 2. But your actual variable names, it turns out, are not anything 1 and 2, they are just two separate names. So applying the 2-variable transformation should not be done in a -forvalues i = 1/2 - loop the way you have done it, with a separate command for each variable inside. The intent of that loop was to loop over the variables applying the 2-x transformation to each one once. Your code applies the 2-variable transformation twice to each variable, which basically undoes it (2-(2-x) = x). Since your variable names do not actually end in numbers 1 and 2, you can't loop over them with a -forvalues i = 1/2- loop, you have to loop over the variable names instead.

So here's how you can correctly implement the approach I suggested:

Code:

regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew foreach v of varlist FeesandCharges formalitiesdocumentsnew { replace `v' = 2-`v' } predict change, xb replace change = change - _b[_cons]

I think that now you will find that this approach produces the correct results (and will agree with the other two.)
Comment
Habtesh Amogne

Join Date: Jan 2018

Posts: 9
#8

25 Jan 2018, 00:08

Thank ou for the quick response and comment. however, the result is not same.
CODE
regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew
RESULT omitted

CODE
foreach v of varlist FeesandCharges formalitiesdocumentsnew {
replace `v' = 2-`v'
}
RESULT
. foreach v of varlist FeesandCharges formalitiesdocumentsnew {
2. replace `v' = 2-`v'
3. }
(63 real changes made)
(75 real changes made)

CODE
. predict change, xb
(9 missing values generated)

. replace change = change - _b[_cons]
(95 real changes made)

SUMMARY RESULT
. sum logTEDC change

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
logTEDC | 104 3.750489 1.208117 .6931472 6.548219
change | 95 .6917754 .7729636 -2.042141 2.233698

I don't know where is the problem but, it seems something is missing in the code. the mean of "change " before subtracting the constant is greater than mean of predicted ( logTEDC).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#9

25 Jan 2018, 07:47

You are right. I forgot to also subtract out the covariate terms. So the -replace- command should be:

Code:

replace change = change - _b[_cons] - _b[logTEDC]*logTEDC - _b[lnPCGDP]*lnPCGDP - _b[lnsqkm]*lnsqkm

Now, at this point, this approach is more typing and more complicated than yours and it defeats the spirit of what I was trying to accomplish. I assumed from your original post that you had a lot of X variables and only one or a couple of covariates. If that were the case, this approach would be the simplest way. But because it's the other way around, the simplest approach is probably this:

Code:

regress logTEDC lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew gen change = _b[FeesandCharges]*(2-FeesandCharges) + _b[formalitiesdocumentsnew]*formalitiesdocumentsnew

Sorry about the mistakes in my earlier code.
Comment
Habtesh Amogne

Join Date: Jan 2018

Posts: 9
#10

26 Jan 2018, 05:00

Thank You, Clyde Schechter, for the valuable comments and suggestions.
Everything Works perfectly and same as above method.
Comment
Habtesh Amogne

Join Date: Jan 2018

Posts: 9
#11

29 Jan 2018, 03:43

Hello Everyone

I have some question related to the interpretation of the margins or the change. As I mentioned earlier my model is log-linear. let us assume that when all countries move to the best practice then the change is 10. so does this mean that moving to the best practice reduce TEDC by 10% on average assuming other variable constant or it means on average TEDC reduce by 10 hours. my output variable is in hours.Thanks in advance for any comments and suggestions.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#12

29 Jan 2018, 04:21

The output you get will not automatically transform into the original metric; it is the difference in the expected values in the metric in which the model was estimated at best. See Bill Gould's excellent blog entry on linear regression with log-transformed y; especially the pitfalls when you want predicted values in the original metric.

In general, the idea that the coefficients from such models can be interpreted as percentage change is wrong, too. It is approximately correct for (very) small values. For 0.1 we get

Code:

. display exp(0.1) 1.1051709

which is roughly a 10 percent increase, but for 0.4 we get

Code:

. display exp(0.4) 1.4918247

and the approximation does no longer work well.

Best
Daniel
Comment
Habtesh Amogne

Join Date: Jan 2018

Posts: 9
#13

29 Jan 2018, 07:42

Thank you, Daniel, for the quick and valuable advice
I try to correct my interpretation as follows after reading " Bill Gould's excellent blog entry " explanation about log-linear models. I did not use Poisson regression but I follow his suggestion if I want to use linear regression.
For linear regression of the fom -------ln(y_j) = b₀ + X_jb + ε_j

steps to follow if I fit log regressions to obtain predicted y_jvalues are

Obtain predicted values for ln(y_j) = b₀ + X_jb.

Exponentiate the predicted log values.

Multiply those exponentiated values by exp(σ²/2), where σ² is the square of the root-mean-square-error (RMSE) of the regression.

CODE
regress logTEDC GATT lnPCGDP lnsqkm FeesandCharges formalitiesdocumentsnew i. Geography
gen change = _b[FeesandCharges]*(2-FeesandCharges) + _b[formalitiesdocumentsnew]*(2-formalitiesdocumentsnew)
gen change2 = -1*change // the change is negative because the two policy variables reduce time to export for DC (TEDC). so I need to change it in to positive but the interpretation will be a reduction of time
gen changelevel=exp(change2)
gen changefinal =changelevel*exp(e(rmse)^2/2) // this is the actual amount of reduction in TEDC if all countries move to the best practice in the two policy variables assuming other variables constant.
RESULT
sum TEDC logTEDC change change2 changelevel changefinal

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
TEDC | 104 74.95385 92.02677 2 698
logTEDC | 104 3.750489 1.208117 .6931472 6.548219
change | 95 -.9492753 .3502811 -1.819529 -.2097291
change2 | 95 .9492753 .3502811 .2097291 1.819529
changelevel | 95 2.752041 1.046726 1.233344 6.168951
-------------+--------------------------------------------------------
changefinal | 95 4.497189 1.710485 2.015442 10.08086

Therefore when countries move to the best practice for "FeesandCharges and formalitiesdocumentsnew" assuming other covariates constant time to export for DC (TEDC) will reduce by 4.497189 hours on average.

Question
1. Is this interpretation correct? or do I need to use Poisson regression?
Thanks in advance for your cooperation.
Comment
Ikechukwu Nwaka

Join Date: Oct 2015

Posts: 23
#14

20 Oct 2018, 09:14

Hello Everyone,

I am also applying counterfactual analysis using a probit model in the outcome equation. My main aim is to analyze counterfactual food insecurity of FHHs – reflecting what food insecurity of females would be when the characteristics of the male-heads are swapped into those of females'
First, the model is simplified to run separate probit models each for MHH and FHH families as such:
MHH_Fd_insec=B_MHH X_MHH + U_MHH if MHH=1 for male-headed families
FHH_Fd_insec = B_FHH X_FHH + U_FHH if MHH=0 for female-headed families.
I tried the following using Stata 14.1:

Code:

xtprobit fd_insec yrsch_hd age_hd hhsize occp_hd if MHH==0// presenting the actual estimates for female-headed families predict male_insec xtprobit fd_insec yrsch_hd age_hd hhsize occp_hd if MHH==1 // presenting the actual estimates for male-headed families predict female_insec

How can I create COUNTER-FACTUALS of female_food insecurity level by allowing female families assume the characteristics of the males?
Your kind help will be appreciated.

Ikechukwu
Comment

Announcement

Counterfactual Analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment