Why STATA omiits an additional category when adding time fixed effects in an event study?

Kareman Yassin

Join Date: Dec 2019
Posts: 15

Why STATA omiits an additional category when adding time fixed effects in an event study?

23 Dec 2022, 13:01

Hello,

I have monthly consumption data that I aggregated to annual data. This is the regression I am trying to run for my event study (with leads and lags) that looks at household consumption around an event happening in "post_0_treated_year".

reghdfe aver_logG11A pre_2_treated_year post_0_treated_year post_1_treated_year post_2_treated_year if treat_group==1 & around_treat_1==1, absorb(ID) cluster(postalcode)

I dropped the variable "pre_1_treated_year" as this is my omitted category. This regression works fine.

Now I want to add year of consumption fixed effect:

reghdfe aver_logG11A pre_2_treated_year post_0_treated_year post_1_treated_year post_2_treated_year if treat_group==1 & around_treat_1==1, absorb(ID cons_year) cluster(postalcode)

Now, I don't understand why Stata omits an additional category "post_2_treated_year" to the category I initially dropped "pre_1_treated_year".

Your help is highly appreciated,
Thank you

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID float(cons_year aver_logG11A pre_2_treated_year pre_1_treated_year post_0_treated_year post_1_treated_year post_2_treated_year around_treat_1 treat_group) str11 postalcode
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2007   3.00221 1 0 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2008 3.6030354 0 1 0 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2009         . 0 0 1 0 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2010         . 0 0 0 1 0 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2011         . 0 0 0 0 1 1 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2012         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2013         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2014         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2015         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2015         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2015         . 0 0 0 0 0 0 1 "T1A3Y8"
500002 2015         . 0 0 0 0 0 0 1 "T1A3Y8"
end

Tags: None

Kareman Yassin

Join Date: Dec 2019

Posts: 15
#2

23 Dec 2022, 13:10

Please don't mind the missing consumption observations above. In my complete dataset, I have over 600,000 observations.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

23 Dec 2022, 15:58

Inspecting your data, I notice that the variable cons_year is constant within ID, and the pre_*_treated_year and post_*_treated_year variables are all constant within any given combination of ID and cons_year. This induces a perfect colinearity among the pre_*_treated_year, post_*_treated_year variables with the ID and cons_year fixed effects. In order to break that colinearity, something has to be omitted. In the first go-around, you omitted the cons_year variable. In the second, since you included cons_year, Stata had to delete one of the pre_*_treated_year or post_*_treated_year variables. It (arbitrarily) chose pre_2_treated.

More generally, in a regression model if you have a series of indicators for years, and then you attempt to add another variable that indicates a particular year, or a particular subset of all the years, you end up with a colinearity that must be broken in order to identify the regression parameters.

Added: By the way, in your data example., there are many, many purely duplicate observations. In fact, there are actually only 9 distinct observations--all the others are exact copies of one of the others. Perhaps there are other variables in your full data set that distinguish these observations. But if you have all these duplicates in the full data set, then it probably means something went wrong in building the data set. It is rarely correct to have any exact duplicate observations in a data set. So, unless you have a good explanation for this, you need to review the data management that created this data set and find and fix whatever errors led to this.

Last edited by Clyde Schechter; 23 Dec 2022, 16:04.
Comment
Kareman Yassin

Join Date: Dec 2019

Posts: 15
#4

24 Dec 2022, 06:26

Thank you, Clyde.
You are right. It is because I aggregated the data from monthly to annual observations.
To overcome this, I deleted all duplicates and kept by the code "drop if cons_month!=1".

The way I created the pre_*_treated_year and post_*_treated_year variables is by looking at cons_year and treat_year variables to create the leads and lags;

g pre_1_treated_year=0
replace pre_1_treated_year=1 if treat_year -1 ==cons_year
g post_0_treated_year= 0
replace post_0_treated_year=1 if treat_year==cons_year

Any further advice on how I can incorporate year-fixed effects?

input long ID float(cons_year cons_month treat_year aver_logG11A pre_2_treated_year pre_1_treated_year post_0_treated_year post_1_treated_year post_2_treated_year) str11 postalcode float around_treat_1
500002 2007 1 2009 3.00221 1 0 0 0 0 "T1A3Y8" 1
500002 2008 1 2009 3.6030354 0 1 0 0 0 "T1A3Y8" 1
500002 2009 1 2009 . 0 0 1 0 0 "T1A3Y8" 1
500002 2010 1 2009 . 0 0 0 1 0 "T1A3Y8" 1
500002 2011 1 2009 . 0 0 0 0 1 "T1A3Y8" 1
500002 2012 1 2009 . 0 0 0 0 0 "T1A3Y8" 0
500002 2013 1 2009 . 0 0 0 0 0 "T1A3Y8" 0
500002 2014 1 2009 . 0 0 0 0 0 "T1A3Y8" 0
500002 2015 1 2009 . 0 0 0 0 0 "T1A3Y8" 0
500002 2016 1 2009 . 0 0 0 0 0 "T1A3Y8" 0
500002 2017 1 2009 3.0575035 0 0 0 0 0 "T1A3Y8" 0
500002 2018 1 2009 3.1234145 0 0 0 0 0 "T1A3Y8" 0
500002 2019 1 2009 3.1065235 0 0 0 0 0 "T1A3Y8" 0
500011 2007 1 2008 2.5467346 0 1 0 0 0 "T1A3Y9" 1
500011 2008 1 2008 2.645737 0 0 1 0 0 "T1A3Y9" 1
500011 2009 1 2008 2.3502724 0 0 0 1 0 "T1A3Y9" 1
500011 2010 1 2008 2.1683176 0 0 0 0 1 "T1A3Y9" 1
500011 2011 1 2008 2.1529005 0 0 0 0 0 "T1A3Y9" 0
500011 2012 1 2008 2.0756474 0 0 0 0 0 "T1A3Y9" 0
500011 2013 1 2008 2.3095813 0 0 0 0 0 "T1A3Y9" 0
500011 2014 1 2008 2.2333722 0 0 0 0 0 "T1A3Y9" 0
500011 2015 1 2008 2.084697 0 0 0 0 0 "T1A3Y9" 0
500011 2016 1 2008 1.9511445 0 0 0 0 0 "T1A3Y9" 0
500011 2017 1 2008 2.0394382 0 0 0 0 0 "T1A3Y9" 0
500011 2018 1 2008 2.1257088 0 0 0 0 0 "T1A3Y9" 0
500011 2019 1 2008 1.9619623 0 0 0 0 0 "T1A3Y9" 0
500012 2007 1 2008 2.827986 0 1 0 0 0 "T1A3Y9" 1
500012 2008 1 2008 2.95071 0 0 1 0 0 "T1A3Y9" 1
500012 2009 1 2008 2.616236 0 0 0 1 0 "T1A3Y9" 1
500012 2010 1 2008 2.2561789 0 0 0 0 1 "T1A3Y9" 1
500012 2011 1 2008 2.567861 0 0 0 0 0 "T1A3Y9" 0
500012 2012 1 2008 2.58276 0 0 0 0 0 "T1A3Y9" 0
500012 2013 1 2008 2.2110295 0 0 0 0 0 "T1A3Y9" 0
500012 2014 1 2008 2.540599 0 0 0 0 0 "T1A3Y9" 0
500012 2015 1 2008 2.207406 0 0 0 0 0 "T1A3Y9" 0
500012 2016 1 2008 2.1315262 0 0 0 0 0 "T1A3Y9" 0
500012 2017 1 2008 2.4301245 0 0 0 0 0 "T1A3Y9" 0
500012 2018 1 2008 2.6724505 0 0 0 0 0 "T1A3Y9" 0
500012 2019 1 2008 2.5272865 0 0 0 0 0 "T1A3Y9" 0
500013 2007 1 2012 2.659758 0 0 0 0 0 "T1A3Z2" 0
500013 2008 1 2012 2.947132 0 0 0 0 0 "T1A3Z2" 0
500013 2009 1 2012 2.703247 0 0 0 0 0 "T1A3Z2" 0
500013 2010 1 2012 2.2754622 1 0 0 0 0 "T1A3Z2" 1
500013 2011 1 2012 2.3566437 0 1 0 0 0 "T1A3Z2" 1
500013 2012 1 2012 2.0968745 0 0 1 0 0 "T1A3Z2" 1
500013 2013 1 2012 2.2199185 0 0 0 1 0 "T1A3Z2" 1
500013 2014 1 2012 2.151134 0 0 0 0 1 "T1A3Z2" 1
500013 2015 1 2012 2.0622241 0 0 0 0 0 "T1A3Z2" 0
500013 2016 1 2012 2.0817647 0 0 0 0 0 "T1A3Z2" 0
500013 2017 1 2012 2.291003 0 0 0 0 0 "T1A3Z2" 0
500013 2018 1 2012 2.3351383 0 0 0 0 0 "T1A3Z2" 0
500013 2019 1 2012 2.3085253 0 0 0 0 0 "T1A3Z2" 0
500015 2007 1 2008 3.0049164 0 1 0 0 0 "T1A3Y9" 1
500015 2008 1 2008 3.118643 0 0 1 0 0 "T1A3Y9" 1
500015 2009 1 2008 3.0627625 0 0 0 1 0 "T1A3Y9" 1
500015 2010 1 2008 2.8997 0 0 0 0 1 "T1A3Y9" 1
500015 2011 1 2008 2.8020704 0 0 0 0 0 "T1A3Y9" 0
500015 2012 1 2008 2.742058 0 0 0 0 0 "T1A3Y9" 0
500015 2013 1 2008 2.610525 0 0 0 0 0 "T1A3Y9" 0
500015 2014 1 2008 2.7011805 0 0 0 0 0 "T1A3Y9" 0
500015 2015 1 2008 2.585472 0 0 0 0 0 "T1A3Y9" 0
500015 2016 1 2008 2.514077 0 0 0 0 0 "T1A3Y9" 0
500015 2017 1 2008 2.554027 0 0 0 0 0 "T1A3Y9" 0
500015 2018 1 2008 2.7253745 0 0 0 0 0 "T1A3Y9" 0
500015 2019 1 2008 2.519378 0 0 0 0 0 "T1A3Y9" 0
500025 2007 1 2010 3.543462 0 0 0 0 0 "T1A3Z1" 0
500025 2008 1 2010 3.118643 1 0 0 0 0 "T1A3Z1" 1
500025 2009 1 2010 4.099472 0 1 0 0 0 "T1A3Z1" 1
500025 2010 1 2010 3.754006 0 0 1 0 0 "T1A3Z1" 1
500025 2011 1 2010 3.7471156 0 0 0 1 0 "T1A3Z1" 1
500025 2012 1 2010 3.423333 0 0 0 0 1 "T1A3Z1" 1
500025 2013 1 2010 3.8096955 0 0 0 0 0 "T1A3Z1" 0
500025 2014 1 2010 4.0888476 0 0 0 0 0 "T1A3Z1" 0
500025 2015 1 2010 3.440675 0 0 0 0 0 "T1A3Z1" 0
500025 2016 1 2010 2.969068 0 0 0 0 0 "T1A3Z1" 0
500025 2017 1 2010 3.022299 0 0 0 0 0 "T1A3Z1" 0
500025 2018 1 2010 3.090267 0 0 0 0 0 "T1A3Z1" 0
500025 2019 1 2010 2.883655 0 0 0 0 0 "T1A3Z1" 0
500028 2007 1 2010 3.0263076 0 0 0 0 0 "T1A3Z1" 0
500028 2008 1 2010 3.1282585 1 0 0 0 0 "T1A3Z1" 1
500028 2009 1 2010 3.074191 0 1 0 0 0 "T1A3Z1" 1
500028 2010 1 2010 2.5715125 0 0 1 0 0 "T1A3Z1" 1
500028 2011 1 2010 2.61623 0 0 0 1 0 "T1A3Z1" 1
500028 2012 1 2010 2.397949 0 0 0 0 1 "T1A3Z1" 1
500028 2013 1 2010 2.76553 0 0 0 0 0 "T1A3Z1" 0
500028 2014 1 2010 2.749447 0 0 0 0 0 "T1A3Z1" 0
500028 2015 1 2010 2.554122 0 0 0 0 0 "T1A3Z1" 0
500028 2016 1 2010 2.473531 0 0 0 0 0 "T1A3Z1" 0
500028 2017 1 2010 2.653307 0 0 0 0 0 "T1A3Z1" 0
500028 2018 1 2010 2.7603786 0 0 0 0 0 "T1A3Z1" 0
500028 2019 1 2010 2.672886 0 0 0 0 0 "T1A3Z1" 0
500031 2007 1 2009 3.31795 1 0 0 0 0 "T1A3Z1" 1
500031 2008 1 2009 3.3417866 0 1 0 0 0 "T1A3Z1" 1
500031 2009 1 2009 2.991304 0 0 1 0 0 "T1A3Z1" 1
500031 2010 1 2009 2.678122 0 0 0 1 0 "T1A3Z1" 1
500031 2011 1 2009 2.805276 0 0 0 0 1 "T1A3Z1" 1
500031 2012 1 2009 2.6316535 0 0 0 0 0 "T1A3Z1" 0
500031 2013 1 2009 2.651586 0 0 0 0 0 "T1A3Z1" 0
500031 2014 1 2009 2.6733234 0 0 0 0 0 "T1A3Z1" 0
500031 2015 1 2009 2.723623 0 0 0 0 0 "T1A3Z1" 0
end
[/CODE]
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

24 Dec 2022, 11:30

Any further advice on how I can incorporate year-fixed effects?

You can't. It is a mathematical impossibility. The variables you are defining are colinear with the ID and year-fixed effects, so you cannot have all of them in the model. Since I imagine you are much more interested in the pre_ and post_ effects, you have to forgo the year fixed effects.

If you wanted to use a random effects model instead, then I think you can then add i.cons_year, because the random effects modeling would remove the ID fixed effects and break the colinearity. But you will have to be prepared to defend your use of a random effects model--these are frowned upon in economics and finance. And saying you did it so that you could include the year indicators probably won't cut it.

Last edited by Clyde Schechter; 24 Dec 2022, 11:34.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30097

24 Dec 2022, 14:37

Since it's a slow day, let me elaborate a bit on this colinearity problem and why you can't, and shouldn't try, to include the cons_year variables if you also need estimates of the pre_* and post_* effects.

From a purely linear algebra perspective, when there is a colinearity, the regression coefficients are unidentified, and something must be done to correct that. Unidentified means that there is no unique set of coefficients that will work: there are many (infinitely many, in fact) equally good solutions. One way to resolve this difficulty is to impose one or more constraints on the coefficients, such as specifying that their sum must equal some specified number, or that some specified pair of them must be equal, or something like that. The most commonly used way, and the one that is built-in automatically in Stata is to constrain one of them to be zero. Constraining a coefficients to be zero is exactly the same thing as omitting that variable. Sometimes more than one constraint is needed to break the colinearity.

When you do impose sufficiently many constraints to identify the model, evidently that constraint will affect the coefficients you get for the variables that participate in the colinearity. Different constraints will, in general, lead to different coefficient estimates. The coefficients of variables that are not involved in the colinearity are not affected by this choice: they will be the same regardless which constraint is imposed. But the coefficients of all the variables involved in the colinearity will change.

If the colinearity is one that does not involved variables you are really interested in, only variables included to adjust for their nuisance effects (i.e. so-called "control variables") this is not a problem. The coefficients of those nuisance variables are meaningless artifacts of which constraint was used, but since you don't care about those coefficients anyway, it's not an issue. But when, as in your case, the coefficients for the effects you are really trying to study are also involved in the colinearity, this is disastrous. It means that the key results you get have non real meaning--they, too, too, are artifacts determined by which particular constraint you use to break the colinearity.

I'll demonstrate how this plays out in your example data. I have to modify your regression command a bit to do this because you did not include the treat_group variable. But nothing changes in principle. We will run

Code:

reghdfe aver_logG11A pre_2_treated_year post_0_treated_year ///
    post_1_treated_year post_2_treated_year ///
    if around_treat_1==1, absorb(ID) cluster(postalcode)

two different ways. In one we will add i.cons_year. In the second we will add ib2011.cons_year. The latter is an instruction to Stata to use 2011 as the base year instead of the default 2007. So all we are doing is changing the base year for the cons_year indicator ("dummy") variables.

Code:

. reghdfe aver_logG11A pre_2_treated_year post_0_treated_year ///
>     post_1_treated_year post_2_treated_year i.cons_year ///
>     if around_treat_1==1, absorb(ID) cluster(postalcode) resid
(MWFE estimator converged in 1 iterations)
warning: missing F statistic; dropped variables due to collinearity or too few clusters
note: 2014.cons_year omitted because of collinearity

HDFE Linear regression                            Number of obs   =         34
Absorbing 1 HDFE group                            F(  10,      3) =          .
Statistics robust to heteroskedasticity           Prob > F        =          .
                                                  R-squared       =     0.8926
                                                  Adj R-squared   =     0.7784
                                                  Within R-sq.    =     0.5371
Number of clusters (postalcode) =          4      Root MSE        =     0.2401

                                    (Std. err. adjusted for 4 clusters in postalcode)
-------------------------------------------------------------------------------------
                    |               Robust
       aver_logG11A | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------------+----------------------------------------------------------------
 pre_2_treated_year |   -.230728   .0951976    -2.42   0.094    -.5336893    .0722333
post_0_treated_year |  -.1446144   .0843869    -1.71   0.185    -.4131712    .1239423
post_1_treated_year |  -.2799469   .0590188    -4.74   0.018    -.4677711   -.0921227
post_2_treated_year |  -.3247856   .0806614    -4.03   0.028    -.5814861    -.068085
                    |
          cons_year |
              2008  |   .1166948   .1818424     0.64   0.567    -.4620089    .6953986
              2009  |   .0921805   .0898297     1.03   0.380    -.1936976    .3780587
              2010  |  -.1034243   .1002867    -1.03   0.378    -.4225814    .2157328
              2011  |  -.0241536   .0537205    -0.45   0.683    -.1951163     .146809
              2012  |   -.195858   .0325272    -6.02   0.009     -.299374    -.092342
              2013  |   .0239458    .027478     0.87   0.448    -.0635014    .1113931
              2014  |          0  (omitted)
                    |
              _cons |   3.047999   .1172532    26.00   0.000     2.674847    3.421151
-------------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          ID |         8           8           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

.
. reghdfe aver_logG11A pre_2_treated_year post_0_treated_year ///
>     post_1_treated_year post_2_treated_year ib2011.cons_year ///
>     if around_treat_1==1, absorb(ID) cluster(postalcode) resid
(MWFE estimator converged in 1 iterations)
warning: missing F statistic; dropped variables due to collinearity or too few clusters
note: 2014.cons_year omitted because of collinearity

HDFE Linear regression                            Number of obs   =         34
Absorbing 1 HDFE group                            F(  10,      3) =          .
Statistics robust to heteroskedasticity           Prob > F        =          .
                                                  R-squared       =     0.8926
                                                  Adj R-squared   =     0.7784
                                                  Within R-sq.    =     0.5371
Number of clusters (postalcode) =          4      Root MSE        =     0.2401

                                    (Std. err. adjusted for 4 clusters in postalcode)
-------------------------------------------------------------------------------------
                    |               Robust
       aver_logG11A | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------------+----------------------------------------------------------------
 pre_2_treated_year |  -.2387792   .0965798    -2.47   0.090    -.5461394    .0685809
post_0_treated_year |  -.1365632   .0732986    -1.86   0.159     -.369832    .0967055
post_1_treated_year |  -.2638445   .0520788    -5.07   0.015    -.4295826   -.0981064
post_2_treated_year |  -.3006319   .0611413    -4.92   0.016    -.4952109   -.1060529
                    |
          cons_year |
              2007  |   .0563585   .1253479     0.45   0.683    -.3425544    .4552713
              2008  |   .1650021   .1518466     1.09   0.357    -.3182417    .6482458
              2009  |   .1324366    .079922     1.66   0.196    -.1219108    .3867839
              2010  |  -.0712195   .1065313    -0.67   0.552    -.4102497    .2678107
              2012  |  -.1797556   .0287628    -6.25   0.008    -.2712918   -.0882194
              2013  |    .031997   .0147608     2.17   0.119    -.0149785    .0789726
              2014  |          0  (omitted)
                    |
              _cons |    3.00277    .095798    31.34   0.000     2.697898    3.307642
-------------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          ID |         8           8           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

Notice that the coefficients for the pre_* and post_* variables are all different in the two outputs. The differences are not huge. But, in fact, I could contrive a constraint that would make the differences large if I took the time and trouble to do so. So the point is that neither set of results is actually assessing the effects of the pre_* and post_* variables. Instead, those effects are contaminated by the particular constraint used to eliminate the colinearity.

It is important to note, however, that although they give different coefficient estimates, the models are in fact equivalent, fitting the data identically. If I calculate the predicted values of aver_logG11A from both models, except for tiny rounding errors, they come out exactly the same:

Code:

. reghdfe aver_logG11A pre_2_treated_year post_0_treated_year ///
>     post_1_treated_year post_2_treated_year i.cons_year ///
>     if around_treat_1==1, absorb(ID) cluster(postalcode) resid
(MWFE estimator converged in 1 iterations)
warning: missing F statistic; dropped variables due to collinearity or too few clusters
note: 2014.cons_year omitted because of collinearity

HDFE Linear regression                            Number of obs   =         34
Absorbing 1 HDFE group                            F(  10,      3) =          .
Statistics robust to heteroskedasticity           Prob > F        =          .
                                                  R-squared       =     0.8926
                                                  Adj R-squared   =     0.7784
                                                  Within R-sq.    =     0.5371
Number of clusters (postalcode) =          4      Root MSE        =     0.2401

                                    (Std. err. adjusted for 4 clusters in postalcode)
-------------------------------------------------------------------------------------
                    |               Robust
       aver_logG11A | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------------+----------------------------------------------------------------
 pre_2_treated_year |   -.230728   .0951976    -2.42   0.094    -.5336893    .0722333
post_0_treated_year |  -.1446144   .0843869    -1.71   0.185    -.4131712    .1239423
post_1_treated_year |  -.2799469   .0590188    -4.74   0.018    -.4677711   -.0921227
post_2_treated_year |  -.3247856   .0806614    -4.03   0.028    -.5814861    -.068085
                    |
          cons_year |
              2008  |   .1166948   .1818424     0.64   0.567    -.4620089    .6953986
              2009  |   .0921805   .0898297     1.03   0.380    -.1936976    .3780587
              2010  |  -.1034243   .1002867    -1.03   0.378    -.4225814    .2157328
              2011  |  -.0241536   .0537205    -0.45   0.683    -.1951163     .146809
              2012  |   -.195858   .0325272    -6.02   0.009     -.299374    -.092342
              2013  |   .0239458    .027478     0.87   0.448    -.0635014    .1113931
              2014  |          0  (omitted)
                    |
              _cons |   3.047999   .1172532    26.00   0.000     2.674847    3.421151
-------------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          ID |         8           8           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

. predict prediction1, xbd
(66 missing values generated)

.
. reghdfe aver_logG11A pre_2_treated_year post_0_treated_year ///
>     post_1_treated_year post_2_treated_year ib2011.cons_year ///
>     if around_treat_1==1, absorb(ID) cluster(postalcode) resid
(MWFE estimator converged in 1 iterations)
warning: missing F statistic; dropped variables due to collinearity or too few clusters
note: 2014.cons_year omitted because of collinearity

HDFE Linear regression                            Number of obs   =         34
Absorbing 1 HDFE group                            F(  10,      3) =          .
Statistics robust to heteroskedasticity           Prob > F        =          .
                                                  R-squared       =     0.8926
                                                  Adj R-squared   =     0.7784
                                                  Within R-sq.    =     0.5371
Number of clusters (postalcode) =          4      Root MSE        =     0.2401

                                    (Std. err. adjusted for 4 clusters in postalcode)
-------------------------------------------------------------------------------------
                    |               Robust
       aver_logG11A | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
--------------------+----------------------------------------------------------------
 pre_2_treated_year |  -.2387792   .0965798    -2.47   0.090    -.5461394    .0685809
post_0_treated_year |  -.1365632   .0732986    -1.86   0.159     -.369832    .0967055
post_1_treated_year |  -.2638445   .0520788    -5.07   0.015    -.4295826   -.0981064
post_2_treated_year |  -.3006319   .0611413    -4.92   0.016    -.4952109   -.1060529
                    |
          cons_year |
              2007  |   .0563585   .1253479     0.45   0.683    -.3425544    .4552713
              2008  |   .1650021   .1518466     1.09   0.357    -.3182417    .6482458
              2009  |   .1324366    .079922     1.66   0.196    -.1219108    .3867839
              2010  |  -.0712195   .1065313    -0.67   0.552    -.4102497    .2678107
              2012  |  -.1797556   .0287628    -6.25   0.008    -.2712918   -.0882194
              2013  |    .031997   .0147608     2.17   0.119    -.0149785    .0789726
              2014  |          0  (omitted)
                    |
              _cons |    3.00277    .095798    31.34   0.000     2.697898    3.307642
-------------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
          ID |         8           8           0    *|
-----------------------------------------------------+
* = FE nested within cluster; treated as redundant for DoF computation

. predict prediction2, xbd
(66 missing values generated)

.
. assert float(prediction1) == float(prediction2)

.

(The assert statement without the float() truncation reports 3 discrepancies. But if you examine those three observations you will see that they agree to at least 6 decimal places. In an idealized computer with infinite-predicsion floating-point calculations, there would be no discrepancies at all.)

Comment

Kareman Yassin

Join Date: Dec 2019

Posts: 15
#7

04 Jan 2023, 11:21

Thank you, Clyde, for taking the time to explain this thoroughly.
It is clear to me now. And I also learned when it's the best time to ask questions
Comment
Kareman Yassin

Join Date: Dec 2019

Posts: 15
#8

12 Jan 2023, 07:21

Hi Clyde Schechter,

I have a follow-up question about fixed effects versus control variables:

Are these two regressions the same? and if not what are the differences in their interpretation?

"reghdfe aver_logG11A pre_2_treated_year post_0_treated_year post_1_treated_year post_2_treated_year if treat_group==1 & around_treat_1==1, absorb(ID cons_year) cluster(postalcode)"

"reghdfe aver_logG11A pre_2_treated_year post_0_treated_year post_1_treated_year post_2_treated_year i.cons_year if treat_group==1 & around_treat_1==1, absorb(ID ) cluster(postalcode)"

As you mentioned before I can't add year-fixed effects "cons_year" as it's colinear with variables I am defining "pre_*_treated_year and post_*_treated_year".
But I need to control for time in this specification to capture time-variant conditions that affect all houses equivalently in a given year, such as weather (all houses are located in the same city, so experience similar weather) or energy prices.

Kareman
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#9

12 Jan 2023, 10:27

Are these two regressions the same? and if not what are the differences in their interpretation?

At the risk of sounding obscure, they are the same model, but they may be different analyses.

By a model, I mean a set of equations that provides predictions of the outcome variable from given values of the predictor/explanatory variables. Both of these regressions will give exactly the same predicted values of aver_logG11A in all observations in the data. (Well, depending on issues of rounding you might find small differences in far out decimal places, but none of your data is likely to be precise enough for that to matter anyway.)

By an analysis, I mean an attempt to explain the relationships between the outcome variable and the given values of the predictor/explanatory variables. In this situation the key results of interest are not the predicted values but the regression coefficients themselves. These two regressions may or may not give the same coefficients. The reason for that is, as we have already discussed earlier in the thread, this model is unidentified due to a colinearity among the post* variables and the i.cons_year indicators and the ID fixed_effects. The two commands may result in the colinearity being broken differently in their corresponding outputs. In that case, you can see different coefficients for the variables that participate in the colinearity. And since, in either case, the coefficients are simply artifacts of the way in which the colinearity was broken, neither analysis can be regarded as providing valid relationships between those variables and aver_logG11A. (The coefficients for the other variables, however, will be the same in both analyses and are valid estimates of the strengths of the associations.)

I am equivocating as to whether you will actually see two different results from those regressions because I do not know how -reghdfe- goes about resolving colinearities, so I can only say that it may or may not handle them differently in these two situations. I wish to emphasize, however, that due to the known lack of identification of the coefficients in this model, even if both of these regressions provide the same coefficients, that does not make those coefficients valid estimates of associations. They are not.

But I need to control for time in this specification to capture time-variant conditions that affect all houses equivalently in a given year, such as weather (all houses are located in the same city, so experience similar weather) or energy prices.

Unfortunately, linear algebra does not care what we humans need to do. It is a mathematical impossibility to analyze the ID, cons_year, and post* variables simultaneously in a fixed-effects regression. As I see it, you have three alternatives:
Give up on the cons_year adjustments. Maybe that are not so important.

If those time-variant ID invariant effects are truly important, you may be able to get an adequate analysis without the cons_year indicators by directly including variables reflecting energy prices, weather, etc. Of course, it will remain possible that there are other such effects that you have missed. In my field, epidemiology, this is a fact of life and we live with it, acknowledging it as a limitation of our methods. We then anticipate other studies which may include more or different adjustments to lessen the gap.

Although fixed effects models are held nearly sacrosanct in certain disciplines, this is a good example of a problem that they are inherently incapable of solving. You may need to do a mixed-effects model instead. The latter do have their limitations, but an answer with limitations is often better than no answer at all.
Comment

Announcement

Why STATA omiits an additional category when adding time fixed effects in an event study?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment