Introducing the Forward DID Command

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

Introducing the Forward DID Command

12 Jul 2024, 18:01

Hey everyone. Happy to finally be sharing software for Stata once more! Been using Python these days. Anyways, I've developed the Forward DID command for Stata. The help file, ado file, as well as two sample datasets are available at my website (since it's still under development, I won't be sending it to ssc just yet, so you'll need to put it at your path for new commands manually, unless there's a way to do this I'm unaware of). At present, it should work for all Statas above and including 16, as it uses frames. There are no special libraries or additional commands the user needs, and it is written entirely in Stata's ado language.

Forward DD comes in handy when we wish to estimate the average treatment effect on the treated for one or more units, but we don't know what the most relevant ones are. It uses a variant of the forward selection algorithm (which daniel klein was most helpful in giving suggestions for the underlying code) to select the optimal control group for a treated unit. We select the optimal control group based on the pre-intervention outcome data. for our control units. After we select the control group, we estimate the ATT and 95% CIs following the method described in the original paper. At present it only is automated for one treated unit, however, if you know enough about the developments in DD, you can likely extend this to multiple treated units with a little dynamic adjustment for the control group).

As usual, feedback and comments are most appreciated. For an example of how it works, we can do

Code:

u "agbasque.dta", clear


qui fdid gdpcap, tr(treat) gr1opts(scheme(sj) name(ag, replace))

cwf cfframe

This returns the following frame (the cfframe) :

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double(year gdpcap5) float(cf te)
1955  3.853184630005267   3.75793     .0952546
1956 3.9456582961508766   3.90803    .03762826
1957  4.033561734872626 4.0553446  -.021782847
1958  4.023421896896646  4.097583   -.07416092
1959  4.013781968405232 4.1396422   -.12586027
1960  4.285918396222732  4.401853   -.11593468
1961  4.574336095797406  4.677667   -.10333104
1962  4.898957353563045  4.938842   -.03988494
1963  5.197014981629133  5.187985   .009029562
1964 5.3389029787527225  5.259322    .07958081
1965  5.465153005251848  5.324697    .14045647
1966  5.545915627064143  5.448125    .09779026
1967  5.614895726639487  5.563021    .05187454
1968 5.8521849330715785   5.79924     .0529453
1969 6.0814054173695915    6.0361     .0453055
1970   6.17009424134957  6.171775 -.0016810996
1971  6.283633404546246  6.315913  -.032279797
1972 6.5555553986528405    6.6104    -.0548448
1973  6.810768561103078   6.90096   -.09019189
1974  7.105184302810804  7.055095    .05008958
1975  7.377891682175629   7.20316     .1747319
1976  7.232933621922754   7.27621   -.04327621
1977  7.089831372119127  7.344905   -.25507352
1978  6.786703607144611  7.312414    -.5257106
1979 6.6398173868571035  7.322126    -.6823086
1980  6.562839171369564  7.367006    -.8041667
1981   6.50078545499277  7.436914    -.9361285
1982  6.545058606999563  7.550632   -1.0055734
1983  6.595329801139407  7.669598   -1.0742679
1984  6.761496750091492  7.768819   -1.0073225
1985  6.937160671727721  7.872968    -.9358075
1986  7.332191151300521  8.342334   -1.0101427
1987  7.742788123594152  8.811522   -1.0687335
1988   8.12053664075889  9.270319   -1.1497823
1989  8.509711162324157  9.724476   -1.2147647
1990  8.776777889074104  9.961907   -1.1851295
1991   9.02527866619582 10.199697   -1.1744179
1992  8.873892824706335  9.992613     -1.11872
1993  8.718223539089278  9.781245   -1.0630217
1994  9.018137849286365  10.13043   -1.1122934
1995  9.440873861653367 10.433558    -.9926846
1996   9.68651813767495 10.676703    -.9901853
1997 10.170665872808662  11.12229    -.9516248
end
format %ty year

Here we have the counterfactual for the Basque Country had terrorism not occurred, and we also have the observed values. The counterfactual is a convex, uniform combination of the states Cataluna and Aragon, replicating the original findings from the first paper describing the synthetic control method. Please, do let me know how you like it (if you do!).

Tags: None

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

13 Jul 2024, 09:13

Th way you may install fdid and its associated help file is

Code:

copy "https://raw.githubusercontent.com/jgreathouse9/jgreathouse9.github.io/master/stata/fdid/fdid.ado" "C:\ado\plus\f\fdid.ado", replace
copy "https://raw.githubusercontent.com/jgreathouse9/jgreathouse9.github.io/master/stata/fdid/fdid.sthlp" "C:\ado\plus\f\fdid.sthlp", replace

where you copy the files from my github to wherever your ado files are stored. For example, in my case (starting without the files in my directory) I did

Code:

copy "https://raw.githubusercontent.com/jgreathouse9/jgreathouse9.github.io/master/stata/fdid/fdid.ado" "C:\ado\plus\f\fdid.ado", replace
copy "https://raw.githubusercontent.com/jgreathouse9/jgreathouse9.github.io/master/stata/fdid/fdid.sthlp" "C:\ado\plus\f\fdid.sthlp", replace

clear *

u "https://github.com/jgreathouse9/jgreathouse9.github.io/raw/master/stata/fdid/hcw.dta"

fdid gdp, tr(treat) unitnames(state) gr1opts(scheme(sj) name(hcw, replace))

So long as you install it at the specified directory, you should be able to get the same results as I did.

Comment

George Ford

Join Date: Aug 2014

Posts: 3148
#3

13 Jul 2024, 09:43

Thanks, Jared. I've got it up and running with my own data. Very fast.

A couple of thoughts from early use:

1. I've noticed the d(df_m) and e(F) and e(p) are empty. Maybe not useful, but empty in both your data and mine.

2. The results are not presented after the algorithm runs. I had to "matrix list e(ATTS)" to see results. It might be nice just to have them automatically presented (like sdid).

3. One think I like about sdid is that it gives you a coefficient, se, and t, thus making it a fairly typical presentation. The se after fdid can be calculated (assuming 1.96, which is what the ado has). But perhaps a summary presentation like sdid of the results would be useful.

4. I'm using preserve/restore. Since you're using frames, it might be nice to keep the original dataset and then create two frames for the modified data. After estimation, isome text could indicate what's in what frame and their names.
Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

13 Jul 2024, 09:47

Hey thanks for working with it. Try and install it again (from my most recent post). I give the ATT, 95% CI, and the preintervention R2. I make it more similar to sdid, in that regard.

Edit: I believe this addresses the second point too. I make a new frame that copies the original dataset, then uses that to manipulate the data. I drop it at the end, so the user should have their original df and the cfframe. That way the original data isn't destroyed/altered.

In the help file, i think I'll describe what each variable in cfframe is.

That is, when I run

Code:

clear *

u "https://github.com/jgreathouse9/jgreathouse9.github.io/raw/master/stata/fdid/hcw.dta"
cls
fdid gdp, tr(treat) unitnames(state) gr1opts(scheme(sj) name(hcw, replace))

I get

Code:

Forward Difference-in-Differences

-----------------------------------------------------------------------------
         gdp |     ATT     |     [95% Conf. Interval]     | R-Square     
-------------+---------------------------------------------------------------
       treat |   0.02540      0.01738     0.03343           0.84278
-----------------------------------------------------------------------------
FDID selects philippines, singapore, thailand, norway, mexico, korea, indonesia, newzealand, malaysia, as the optimal donors.
Refer to Li (2024) for theoretical derivations.

in return.

Last edited by Jared Greathouse; 13 Jul 2024, 09:59.

Comment

daniel klein

Join Date: Mar 2014

Posts: 3848
#5

13 Jul 2024, 12:00

I have one general critical comment. The program starts with

Code:

cap frame drop cfframe cap frame drop reshaped

What if the user has data in those frames? It is generally considered bad programming style to wipe objects even if it were documented. Use temporary objects and let the user decide whether to keep any and if so under which name(s).

A minor general comment: drop the

Code:

capture program drop

You never ever need it in a final ado-file. I have encountered this line so often that I was about to write a brief insert for SJ about it. There are hardly any side effects, so I decided not to. But I might write a brief post here on Statalist.
1 like
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#6

13 Jul 2024, 13:30

daniel klein yeah I think you mentioned the last part to me before. I'll do that.

What i may do for the first point, then is either allow users to have the option to return the temporary frames as something for use after the program is finished, or I'll just force them to specify what the name of the frame returned should be, that way there's no clash with preexisting frames.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#7

13 Jul 2024, 16:50

Got the new version and am playing with it. I'm looking forward to studying up on FDID, and I appreciate your willingness to take comments.

Thoughts.

1. Following Daniel, I'd give options on frame names, but still have an odd default (_fdid_frame_1).

2. Like sdid, add the matrix e(series) [including everything you're dropping into ccframe], but have a frame as an option. I think this would be a mkmat from ccframe, so easy. I usually take e(series) from sdid and drop it in a new frame to make a useful graph, so having the option to automatically do so is nice.

3. To the results, I'd add the z and prob level. That stuff is useful, and all the bits are in the ado file.

4. It's unusual for the id to be a string and not a number. I suppose that makes listing them easier in the results, but most programs require a number. The error message was clear, however. It will be a commonly invoked error, given the typical practice of requiring a number.

5. The ereturn could use some modifications.

A. I'd prefer: e(ATT) e(se) e(z) e(p), and maybe e(ATT_lb), e(ATT_ub). Or, maybe post all the results to r(table). Anything you see in the results should be accessible in ereturn or return.. The se appears nowhere in ereturn, so someone would have to type that in to use it later, or manipulate the ATT and bounds to calculate it.

B. The e(cmdline) is a background command from "hidden" estimations. You might post the original command there and dump the extra bits [e(cmd) e(predict)].

C Similarly, I think e(b) and e(V) could confuse people, as these are background results that do not link to the reported results in any obvious way. I think they could be deleted.

D. For e(ATTs), I'd move r2 to e(r2) and drop rmse since it is in e(rmse) already. It might be cleaner to have just e(ATT), e(se), and so forth. I don't mind having the UB/LB in a ATTs matrix, but this requires one to pull from a matrix rather than access directly [but the same is true for r(table)].

E. This isn't necessarily on you, but you still might think about how asdoc/estout/etc... are going to work with this. (I think we both can imagine the stream of posts on statalist about this.) This would requires some rewriting of the way things are presented now, which I don't find very "Stata-like" even if informative.

a. The "Successful" line is nice, but un-needed. If it runs, it runs. Error codes appear to work.
b. Treatment_measured/Treated unit/ControlUnits could be under the results (like notes) (You already have chosen units as a note), which would asdoc cleaner.
c. The table asdoc's a bit strange. That may be an easy fix.
d. The list of the control pool could be stored in an e(controllist) and not reported in the results.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#8

13 Jul 2024, 16:59

I'm not getting the graph in the updated version.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#9

13 Jul 2024, 17:05

In some cases (when I change the DV), the optimal donor list is appearing as numbers, even though the list of the pool in strings. I may have to share some data with you to trace that issue.
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#10

13 Jul 2024, 18:10

Following Daniel, I'd give options on frame names, but still have an odd default (_fdid_frame_1).

yeah I'll go with the "odd name method" fow now, the tempname was giving me problems.

Like sdid, add the matrix e(series) [including everything you're dropping into ccframe], but have a frame as an option

I'll need to run sdid and see what this refers to, but I think I get what you mean.

To the results, I'd add the z and prob level. That stuff is useful, and all the bits are in the ado file.

Agreed.

It's unusual for the id to be a string and not a number.

Well technically, the id is a number. It is the number supplied to xtset. The reason I'm creating value labels for them under the hood, though, is so we can know which donors are what. otherwise, we'd have a control list of, say, "1,45,90,92", and that's less informative. So, the id is technically a number, right?

The ereturn could use some modifications.

I agree, I need to figure out how it will return only the things I define instead of what is also reported by cnsreg.

Or, maybe post all the results to r(table). Anything you see in the results should be accessible in ereturn or return.

I agree, rtable it is. In the newer version, (on my machine) I report the standard error in e(ATTs) (I should really change that name).

Similarly, I think e(b) and e(V) could confuse people, as these are background results that do not link to the reported results in any obvious way. I think they could be deleted.

yeah, related to the ereturn remark above.. I guess this would be a job for ereturn clear?

I think we both can imagine the stream of posts on statalist about this.

Unfortunately, I can.

Treatment_measured/Treated unit/ControlUnits could be under the results (like notes) (You already have chosen units as a note), which would asdoc cleaner.

yeah that's true, it would be cleaner like that. I modeled lots of this off the original synth, so maybe this is a holdout that I could get rid of and put it under the results.

The list of the control pool could be stored in an e(controllist) and not reported in the results.

I agree, but it kind of makes things easier to spot check. It's already reported under e(selected).

I'm not getting the graph in the updated version.

The plot is now optional. If the user specifies gr1opts, then the plot is returned. Else, if they specify nothing, no figure is created.

In some cases (when I change the DV), the optimal donor list is appearing as numbers, even though the list of the pool in strings. I may have to share some data with you to trace that issue.

Yeah email me the data and code you used. The optimal donor pool shouldn't be numbers. Thanks, by the way, for such detailed remarks.
Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

#11

14 Jul 2024, 06:37

As of now, all of the suggestions above were addressed. All the results are returned, the t, p, and SE stats are there, and all the rest. The help and ado files have also been updated to reflect these changes. Assuming you use the versions of fdid at this link, the following code should run without any errors

Code:

clear *

u "https://github.com/jgreathouse9/jgreathouse9.github.io/raw/master/stata/fdid/hcw.dta"

cls
fdid gdp, tr(treat) unitnames(state) gr2opts(scheme(sj) name(hcwte, replace)) 

cls
clear *

import delim "https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv", clear delim(";")

egen id = group(state)

xtset id year, y

fdid packspercapita, tr(treated) unitnames(state) // gr1opts(scheme(sj) name(p99, replace))


cls
clear *

u "https://github.com/jgreathouse9/jgreathouse9.github.io/raw/master/stata/fdid/agbasque.dta", clear

fdid gdpcap, tr(treat)

it also runs when set varabbrev is off, since SJ and ssc will care about that. Now, I guess I have to worry about how to get it to net install correctly.

Comment

George Ford

Join Date: Aug 2014

Posts: 3148
#12

14 Jul 2024, 07:43

Something isn't kosher. The 95% CI do not match up with the SE/p.

PHP Code:

Forward Difference-in-Differences T0 R2: 0.892 T0 RMSE: 0.087 ------------------------------------------------------------------------------------------- linvr | ATT t SE [95% Conf. Interval] p -------------+----------------------------------------------------------------------------- did | -0.187 1.861 0.1006 -0.2466 -0.1277 0.063 -------------------------------------------------------------------------------------------

When you construct the CI, you include /sqrt(`t2') , but you do not when you calculate the t-stat or the probability level.

My reading of Li is:

ATT +- 1.96*sqrt(omegahat/t2).

Looks like line 726 sets omegahat as sqrt(omegahat). But you still need the sqrt(`t2') in there.

Confidence interval from ado:

Code:

745 scalar CILB= scalar(ATT) - ((invnormal(0.975) * scalar(omegahatdid))/sqrt(`t2')) 746 747 scalar CIUB= scalar(ATT) + ((invnormal(0.975) * scalar(omegahatdid))/sqrt(`t2'))

I got mine squared doing this:

Code:

753 scalar tstat = abs(scalar(ATT)/(scalar(omegahatdid)/sqrt(`t2'))) ... 782 di as text %12s abbrev("`treatment'",12) " {c |} " as result %9.3f scalar(ATT) " "%9.3f scalar(tstat) " " %9.4f scalar(omegahatdid)/sqrt(`t2') "

It might be easier to create a scalar se early on, and then use that for all the CI/p/z calculations.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#13

14 Jul 2024, 07:45

Also, rather than invnormal(0.975), do you want to make this sample size specific? I doubt it would matter much, since if you have less than 30 observations, you are not likely to be using this.

But one of the listed advantages of fdid (by Li) is that you can you smaller time samples than with SC.

Last edited by George Ford; 14 Jul 2024, 07:50.
Comment
George Ford

Join Date: Aug 2014

Posts: 3148
#14

14 Jul 2024, 08:05

What would be cool is adding method(fdid) to sdid.
Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

#15

14 Jul 2024, 08:20

Alright folks, the official way to install the package (along with running the empirical examples) is:

Code:

clear *


cls

net install fdid, from("https://raw.githubusercontent.com/jgreathouse9/FDIDTutorial/main")




clear *

u "https://github.com/jgreathouse9/jgreathouse9.github.io/raw/master/stata/fdid/hcw.dta"

cls
fdid gdp, tr(treat) unitnames(state) gr2opts(scheme(sj) name(hcwte, replace)) 

cls
clear *

import delim "https://raw.githubusercontent.com/synth-inference/synthdid/master/data/california_prop99.csv", clear delim(";")

egen id = group(state)

xtset id year, y

fdid packspercapita, tr(treated) unitnames(state) // gr1opts(scheme(sj) name(p99, replace))


cls
clear *

u "https://github.com/jgreathouse9/jgreathouse9.github.io/raw/master/stata/fdid/agbasque.dta", clear

fdid gdpcap, tr(treat)

I had to switch it to my other repo, since net install did not wish to play nice with my github io site, Anyways, here it is!

Announcement

Introducing the Forward DID Command

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment