Identifying matched observations

Pantelis Kazakis

Join Date: Aug 2014
Posts: 123

Identifying matched observations

28 Jan 2019, 10:39

Assume that we have a database with the following variables: id, year, treated, x1, and x2.

In this database, we would like to identify observations (id) based on a matching algorithm. For example, for a treated entity (treated = 1), you need to match another firm from the sample such that:

1. the other firm is not treated for a period [-5, 5] years based on the year of the treated company. For example, if a company is treated at t = 1990, the matched company should not be treated in the period [1985, 1995].

2. Based on x1, the matched company should have values on a specific range, say [70, 100].

3. The treated company's x2 is closest to the x2 of the matched company.

See, for example, the following database:

Code:

input id year treated x1 x2
1    1980    0    137    3
1    1981    0    82    3
1    1982    0    67    2
1    1983    0    41    2
1    1984    0    72    3
1    1985    0    37    4
1    1986    0    134    1
1    1987    0    44    4
1    1988    0    52    4
1    1989    0    49    1
1    1990    1    93    2
1    1991    0    121    4
1    1992    0    137    1
1    1993    0    138    4
1    1994    0    56    4
1    1995    0    112    1
1    1996    0    24    3
1    1997    0    35    1
1    1998    0    80    1
1    1999    0    64    1
1    2000    0    47    1
2    1980    0    54    2
2    1981    0    71    1
2    1982    0    110    4
2    1983    0    76    1
2    1984    0    125    2
2    1985    0    121    1
2    1986    0    58    3
2    1987    0    72    3
2    1988    0    87    2
2    1989    0    33    2
2    1990    0    29    4
2    1991    0    129    4
2    1992    1    96    3
2    1993    0    105    1
2    1994    0    91    1
2    1995    0    72    1
2    1996    0    134    3
2    1997    0    68    1
2    1998    0    33    4
2    1999    0    94    4
2    2000    0    77    3
3    1980    0    56    2
3    1981    0    88    2
3    1982    0    73    2
3    1983    0    31    3
3    1984    1    74    4
3    1985    0    133    2
3    1986    0    77    4
3    1987    0    112    4
3    1988    0    22    1
3    1989    0    74    4
3    1990    0    81    2
3    1991    0    31    3
3    1992    0    23    4
3    1993    0    73    3
3    1994    0    54    3
3    1995    0    115    1
3    1996    1    30    1
3    1997    0    61    2
3    1998    1    61    3
3    1999    0    90    3
3    2000    0    60    1
4    1980    0    22    5
4    1981    0    74    3
4    1982    0    81    4
4    1983    0    31    2
4    1984    0    23    8
4    1985    0    73    6
4    1986    0    54    7
4    1987    0    56    1
4    1988    0    130    3
4    1989    0    85    1
4    1990    0    75    9
4    1991    0    62    8
4    1992    0    15    8
4    1993    0    65    7
4    1994    0    93    5
4    1995    0    22    3
4    1996    0    11    2
4    1997    0    10    7
4    1998    0    29    9
4    1999    0    160    8
4    2000    0    5    5
end

If we would like to match entity with id = 1, we have to exclude that one with id = 2 because it has an event at t = 1992, which is in a time frame we do not like. Entities 3 and 4 are potential choices because both have an x1 in [70, 100]. However, we choose entity 4, because its x2 is closer that that of id = 1.

Apparently, this is a simple scenario and one can see that there may be multiple treatment cases within an entity (see for example entity 3).

I am asking whether there is a code that can generate a new variable (new column) that shows the firm that is matched. That is, for id = 1 this column should have value 4.

Thank you.

Last edited by Pantelis Kazakis; 28 Jan 2019, 10:42.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 29898
#2

28 Jan 2019, 11:28

As I tried to answer your question I became aware that it is ambiguous in at least three ways.

1. Is an observation eligible to serve as a control if it has values of x1 outside the 70-100 range in years that are outside the +5 to -5 year window? Or does any observation of x1 outside the 70-100 range exclude them from ever serving as controls?

2. What do you mean by "the treated company's x2 is closest to the x2 of the matched company?" Each company has many different values of x2. Two companies might be closest on x2 in one year, but not in another year. So do you mean closest on average? Or closest only in the event year? Or closest using, say, an L2 or L-infinity norm? Or what?

3. Assuming some answer to my immediately preceding question, if after eliminating observations outside the year window and outside the x1 window there are two or more observations tied for closest match on x2, which one should be chosen?
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#3

28 Jan 2019, 11:57

Originally posted by Clyde Schechter View Post

As I tried to answer your question I became aware that it is ambiguous in at least three ways.

1. Is an observation eligible to serve as a control if it has values of x1 outside the 70-100 range in years that are outside the +5 to -5 year window? Or does any observation of x1 outside the 70-100 range exclude them from ever serving as controls?

2. What do you mean by "the treated company's x2 is closest to the x2 of the matched company?" Each company has many different values of x2. Two companies might be closest on x2 in one year, but not in another year. So do you mean closest on average? Or closest only in the event year? Or closest using, say, an L2 or L-infinity norm? Or what?

3. Assuming some answer to my immediately preceding question, if after eliminating observations outside the year window and outside the x1 window there are two or more observations tied for closest match on x2, which one should be chosen?

Hi Clyde. Please, let me clarify:

1. A control firm needs to have values in the range specified (say 70-100) and should not be a treatment itself around an event period defined by the treated firm. When treated equals 1, we say that a firm had an event at that specific year. So, we cannot include another firm that had an event around that timeframe (10 years around an event).

2. We have to look at the value of x2 one year before the event and use this value for comparison).

3. The firm with id = 1, has an event in 1990. If we look at x2 in that year we find a value of 2. Now, we need to go and check the other potential firms that can be matched with this one. We do not account firm 2 because it has an event in 1992, which is in the 10 years period sorrounding the event of firm 1 (year 1990). Firms 3 and 4 are OK in terms of x1 (they are both in the range). If we go and check x2 one year before for these firms, we find that for firm with id = 3 the value is 4 and for the other with id = 4 the value is 1. Hence, we say that firm 4 is a better match for firm 1.

I hope this makes sense to you.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29898
#4

28 Jan 2019, 12:24

I still don't understand.

Looking at id #4 as a potential match for id#1's 1990 event, we see that id#4 has x1 = 130 in 1988. 1988 is within the year window for that event, so #4 should be ineligible because 130 is not between 70 and 100. In fact, it looks as if in your example match there are no eligible matches for any of the id's.

Also, you still have not answered my third question in #2.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#5

28 Jan 2019, 12:49

Originally posted by Clyde Schechter View Post

I still don't understand.

Looking at id #4 as a potential match for id#1's 1990 event, we see that id#4 has x1 = 130 in 1988. 1988 is within the year window for that event, so #4 should be ineligible because 130 is not between 70 and 100. In fact, it looks as if in your example match there are no eligible matches for any of the id's.

Also, you still have not answered my third question in #2.

Hi Clyde,

I am providing an example:

Since the event for id#1 is in 1990, we look at values of x1 (and x2) in 1989 for any eligible firms; one year prior to the event year in the treated id. The value of x1 for id#4 in 1989 is 85 and for id#3 is 74, hence in the range. id#4 and id#3 do not have any event for the period 1985-1995 (treated = 0), while id#2 has in 1992. Hence, id#2 cannot be eligible for matching. The value of x2 in 1989 for id#3 is 4 and for id#4 is 1. We have to compare these x2 values and see which is closer to the value of x2 for id#1 in 1990, which is 2. By comparison, we find the id#4 to be closest to id#1, thus it is a better match.

So, if a potential firm to be matched, has an event in the period [-5, +5] years sorrounding the event of another firm (treated), we do not look at it as a potential match.

Hence, the potential firm to be matched must not be in the [-5, +5] event window, it should have values of L1.x1 in the range, and it should have the value L1.x2 closest to the value of x2 of the treated company at time t.

I hope this makes sense.

Last edited by Pantelis Kazakis; 28 Jan 2019, 12:56.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29898

28 Jan 2019, 13:01

This is much clearer now. You still did not respond to my 3rd question in #2. So, the code below will, in the event of a tie for closest match on x2 break the tie randomly and irreproducibily. (That is, if you run the code more than once you will not necessarily come up with the same match.)

Code:

//    CREATE A COPY OF THE DATA
tempfile copy
save `copy'

//    PAIR UP OBSERVATIONS WITHIN A +5 TO -5 YEAR WINDOW
rangejoin year -5 5 using `copy'

//    ELIMINATE POSSIBLE CONTROLS THAT WERE TREATED IN THE WINDOW
keep if treated == 1
by id year id_U, sort: egen ineligible = max(treated_U == 1)
drop if ineligible
drop ineligible

//    ELIMINATE POSSIBLE CONTROLS THAT HAVE X1 OUT OF [70, 100] RANGE
//    IN THE YEAR BEFORE THE EVENT
by id year id_U, sort: egen lagged_x1 = max(cond(year_U == year - 1, x1_U, .))
drop if !inrange(lagged_x1, 70, 100)



//    FIND BEST MATCH ON ONE-YEAR LAGGED VALUE OF X2
by id year id_U, sort: egen lagged_x2 = max(cond(year_U == year-1, x2_U, .))
gen delta = abs(x2 - lagged_x2)
by id year (delta), sort: keep if _n == 1
rename id_U matched_id
drop *_U delta

The -rangejoin- command is written by Robert Picard and is available from SSC. To use it, you must also have installed the -rangestat- command, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Added: This code leaves you only with successfully matched pairs. Where no suitable match is available there is no observation. If you need to still have the original data on the unmatched firms, you can follow the above with

Code:

merge 1:m id year treated using `copy', assert(match using) nogenerate
isid id year, sort

Last edited by Clyde Schechter; 28 Jan 2019, 13:06.

Comment

Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#7

28 Jan 2019, 13:22

^ Thank you Clyde. I'm sorry, I did not notice the tie issue you mentioned at #2.

I understand what you mean, but the data have at least 3 values after comma and cases of ties are difficult to happen. In case of a tie, a potential solution for this would be to use another variable, say x3, and use its distance from the event firm to do the matching when the distance of x2 for treatment-control does not give a decisive answer.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29898
#8

28 Jan 2019, 14:11

OK, well in the absence of something more specific, I guess we'll just leave it the way things are. With a little luck, there won't be any ties on x2 and it's a moot point.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#9

30 Jan 2019, 04:03

Hello again.

Assume that the search process of the above problem is modified a bit. Based on the previous example, we took for granted the values of x1. However, suppose that you needed to create another variable, called it "z" and use that instead of "x1" to proceed in the matching process.

Variable "z" is created as follows: it is the ratio of x1 of the untreated firm to the value of x1 of the treated firm one year before the event. So, for each potential treated-untreated group, one has to find this ratio.

Hence, in this case the second requirement becomes: "2. Based on z, the matched company should have values on a specific range, say [70, 100]".

How would one tackle that question?

Last edited by Pantelis Kazakis; 30 Jan 2019, 04:06.
Comment

Pantelis Kazakis

Join Date: Aug 2014
Posts: 123

#10

30 Jan 2019, 10:03

Let us assume that we have the following database:

Code:

input id year treated z
1    1980    0    71
1    1981    0    87
1    1982    0    89
1    1983    0    81
1    1984    0    125
1    1985    0    139
1    1986    0    65
1    1987    0    111
1    1988    0    80
1    1989    0    123
1    1990    1    65
1    1991    0    100
1    1992    0    70
1    1993    0    85
1    1994    0    108
1    1995    0    139
1    1996    0    133
1    1997    0    108
1    1998    0    76
1    1999    0    118
1    2000    0    95
2    1980    0    78
2    1981    0    112
2    1982    0    121
2    1983    0    111
2    1984    0    63
2    1985    0    117
2    1986    0    107
2    1987    0    100
2    1988    0    115
2    1989    0    102
2    1990    0    91
2    1991    0    138
2    1992    1    81
2    1993    0    91
2    1994    0    68
2    1995    0    80
2    1996    0    115
2    1997    0    79
2    1998    0    92
2    1999    0    98
2    2000    0    138
3    1980    0    71
3    1981    0    114
3    1982    0    112
3    1983    0    95
3    1984    1    130
3    1985    0    121
3    1986    0    121
3    1987    0    105
3    1988    0    127
3    1989    0    67
3    1990    0    72
3    1991    0    124
3    1992    0    81
3    1993    0    83
3    1994    0    72
3    1995    0    68
3    1996    1    83
3    1997    0    64
3    1998    1    75
3    1999    0    123
3    2000    0    128
4    1980    0    72
4    1981    0    66
4    1982    0    81
4    1983    0    123
4    1984    0    140
4    1985    0    85
4    1986    0    86
4    1987    0    81
4    1988    0    115
4    1989    0    72
4    1990    0    100
4    1991    0    140
4    1992    0    82
4    1993    0    129
4    1994    0    72
4    1995    0    113
4    1996    0    68
4    1997    0    66
4    1998    0    98
4    1999    0    60
4    2000    0    106
end

It appears that the following code does the job (although it may not be that elegant):

Code:

sort id year
egen new_id = group(id year treated)

* Create a pseudo id to mark each different event
gen xz = _n if treated == 1
egen yy = group(xz)

* for yy = 1
egen max_yy = max(yy)
disp max_yy // the maximum number of events in the database

* The following code creates columns of ratios based on each different event.

forvalues i = 1(1)5 {
    gen a = z[_n-1] if treated == 1 & yy == `i'
    egen z_ev`i' = max(a)
 
    gen ratio_ev`i' = z/z_ev`i'
    drop a z_ev`i'

    gen b = year if yy == `i' & treated == 1
    egen YEAR_ev`i' = max(b)
    drop b

    gen diff_ev`i' = year - YEAR_ev`i'
    replace ratio_ev`i' = . if diff_ev`i' != -1
    drop YEAR_ev`i' diff_ev`i'
}

* Work on each different column; each column represents a different event.
* replace "." with zeros.

forvalues i = 1(1)5 {
replace ratio_ev`i' = 0 if ratio_ev`i' == .
}

/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */

* generate variable x1 that is the sum of all event columns.
* each entry (except zeros) in x1 is based on a different event.

egen x1 = rowtotal(ratio_ev1-ratio_ev5)

* generate a random variable x2
set seed 1234
generate x2 = runiform()

*******************************************************************
* Embed the code from Statalist to complete the example *
******************************************************************

//    CREATE A COPY OF THE DATA
tempfile copy
save `copy'

//    PAIR UP OBSERVATIONS WITHIN A +5 TO -5 YEAR WINDOW
rangejoin year -5 5 using `copy'

//    ELIMINATE POSSIBLE CONTROLS THAT WERE TREATED IN THE WINDOW
keep if treated == 1
by id year id_U, sort: egen ineligible = max(treated_U == 1)
drop if ineligible
drop ineligible

//    ELIMINATE POSSIBLE CONTROLS THAT HAVE X1 OUT OF [0.7 - 1.3] RANGE
//    IN THE YEAR BEFORE THE EVENT
by id year id_U, sort: egen lagged_x1 = max(cond(year_U == year - 1, x1_U, .))
drop if !inrange(lagged_x1, 0.70, 1.3)


//    FIND BEST MATCH ON ONE-YEAR LAGGED VALUE OF X2
by id year id_U, sort: egen lagged_x2 = max(cond(year_U == year-1, x2_U, .))
gen delta = abs(x2 - lagged_x2)
by id year (delta), sort: keep if _n == 1
rename id_U matched_id
drop *_U delta

* This code adds again the cases where matching was not achieved
*merge 1:m id year treated using `copy', assert(match using) nogenerate
*isid id year, sort
/* ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ */

Comment

Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#11

30 Jan 2019, 10:06

I use Stata MP, so I hope the number of columns will not be a problem, as the number of events is large enough.
Comment

Announcement

Identifying matched observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment