Nearest neighbour matching within industry-year-clusters

Marc Wonoadi

Join Date: Oct 2020
Posts: 7

Nearest neighbour matching within industry-year-clusters

09 Feb 2022, 12:34

Dear Statalist community,

I would very much appreciate your help on the following problem that I am struggling with. I am trying to do a nearest neighbour matching within industry-year-clusters with Stata17. Unfortunately, I keep on receiving the following error message: invalid syntax. I posted the code below so that maybe someone could help me finding the coding mistake.

The variable

Code:

 N_`a'_`b'_N

contains the number of observations per industry (`a')-year (`b') cluster. The variable

Code:

 ceo_power_I

is the matching indicator. The goal is to match observations with

Code:

 ceo_power_I ==1

with observations with

Code:

 ceo_power_I ==0

for all 4 x 9 industry clusters by minimizing the difference in firm size

Code:

 ta_C

. Furthermore,

Code:

n1

contains _n within a `a'x`b'-cluster.

Code:

gen match_firmyr = .
gen match_delta = .
gen matched =.

local a = 1

while `a' <= 4 {
    
    local b = 1

    while `b' <= 9 {
    
        local c = N_`a'_`b'_N
            
        forval i = 1/`c' {
           
            if n1 == `i' & N_`a'_`b'_N == `c' & ceo_power_I == 1 {

                local dist 1000000000^2

                forval j = 1/`c' {
                    
                    if n1 == `j' & N_`a'_`b'_N == `c' & ceo_power_I == 0 {
                        
                        local ta_match_1_C = ta_C if n1 == `i' & N_`a'_`b'_N == `c'
                        local ta_match_2_C = ta_C if n1 == `j' & N_`a'_`b'_N == `c'
                        
                        local current_dist (`ta_match_1_C' - `ta_match_2_C')^2

                        local n_count_D `= _n' if n1 == `j' & N_`a'_`b'_N == `c'
                        
                        if `dist' > `current_dist' {
                            local dist `current_dist' // update smallest distance
                            replace match_firmyr = firmyr_O[`n_count_D'] if n1 == `i' & N_`a'_`b'_N == `c'// write match
                            replace match_delta = `current_dist' if n1 == `i' & N_`a'_`b'_N == `c'// write delta    

                        }

                    }
                }
                
                replace matched =1 if n1 == `i' & N_`a'_`b'_N == `c'
                
            }   

        }

        local b = `b' + 1
    }

    local a = `a' + 1
    
}

I am not very experienced in Stata coding so the issue might be very basic. Hopefully. Thanks so much for your precious time.

Thankfully with best regards
Marc

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

09 Feb 2022, 12:52

With just "invalid syntax" and this lengthy code, and no example data to play with, it would take me hours to figure this out, if I could at all. But this code looks like you are trying to emulate a FORTRAN program in Stata, or something like that. I wouldn't do this this way. As you did not provide example data, I will illustrate the general approach using the -grunfeld- data set that you can access from the StataCorp website. I have arbitrarily designated companies 1 through 5 as "cases" and 6 through 10 as "controls." The code below matches each case with a control observation in the same year, and having the closest possible match on variable kstock. (If there are ties, they are broken randomly and irreproducibly.)

Code:

webuse grunfeld, clear gen byte case = (company <= 5) label define case_control 0 "control" 1 "case" // MATCH CASES WITH CONTROLS HAVING SAME YEAR, AND CLOSEST MATCH IN kstock preserve keep if !case drop case tempfile controls ds year, not rename (`r(varlist)') =_ctrl save `controls' restore keep if case drop case ds year, not rename (`r(varlist)') =_case joinby year using `controls' gen delta = abs(kstock_case-kstock_ctrl) by company_case year (delta), sort: keep if _n == 1

If you have trouble porting this code to your data set, when you post back, please be sure to use the -dataex- command to show an example of your actual data to develop and test code on. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment

Marc Wonoadi

Join Date: Oct 2020
Posts: 7

10 Feb 2022, 10:38

Dear Clyde,

thanks so much for your swift reply and for getting me on the right track. I think I have understood the concept of the webuse-example and I am working on applying it to my data set. Below you find a dataex-example:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int id_N float(ceo_power_I industry_cluster_D year_cluster_D) long ta_C
 1 0 8 1   273012
 1 0 8 1   390895
 1 0 8 1   500018
 1 1 8 2   344955
 1 0 8 2   474808
 1 1 8 3   673725
 1 0 8 3   582507
 1 1 8 4  4591122
 1 0 8 4  5246640
 2 0 3 1   120520
 2 0 3 1   107583
 2 1 3 1   143784
 2 0 3 2   105250
 2 0 3 2    61974
 2 0 3 3    49644
 2 1 3 3    34435
 2 1 3 4    26120
 2 0 3 4    19548
 3 0 3 .    85891
 3 0 3 .    78320
 3 1 3 1    72378
 3 0 3 1    75127
 3 0 3 1    60961
 3 0 3 2    56508
 3 0 3 2    85240
 3 0 3 3    76314
 3 0 3 3    93929
 3 0 3 4    79779
 3 0 3 4    72829
 4 0 5 1    34731
 4 0 5 1    31838
 4 0 5 1    29067
 4 1 5 2    17705
 4 0 5 2    14934
 4 0 5 3    33492
 4 1 5 3    19055
 4 1 5 4    47913
 4 1 5 4    31256
 7 0 4 1   161068
 7 1 4 1   156550
 7 1 4 1   160843
 7 1 4 2   157918
 7 1 4 2   145650
 7 1 4 3   139893
 7 1 4 3   139792
 7 1 4 4   130043
 7 1 4 4   127857
 8 1 5 .    26007
 8 1 5 .    65295
 8 1 5 .    64710
 8 1 5 .    62528
 8 1 5 1    63591
 8 0 5 1    66151
 8 1 5 1    68590
 8 1 5 2    65213
 8 0 5 2    57845
 8 1 5 3    54361
 8 1 5 3    62537
 8 1 5 4    49064
 8 0 5 4    40632
10 0 9 1   151377
10 0 9 1   143220
10 0 9 2   201315
10 1 9 2        0
10 1 9 3   394740
10 1 9 3   300138
10 0 9 4   347592
10 0 9 4   473513
11 1 4 .  9189000
11 1 4 .  8463000
11 1 4 1 10110000
11 1 4 1 10887000
11 1 4 1 11123000
11 1 4 2 11113000
11 0 4 2 11840000
11 1 4 3 12706000
11 1 4 3 14444000
11 1 4 4 13892000
11 1 4 4 14961000
12 0 4 1   202621
12 1 4 2   219840
12 1 4 2   232412
12 0 4 3   232906
12 1 4 3   212529
12 1 4 4   233735
12 1 4 4   217851
14 1 7 1   229418
14 0 7 1   248659
14 1 7 1   271604
14 1 7 2   278081
14 0 7 2   299160
14 1 7 3   362402
14 1 7 3   431893
14 1 7 4   453405
14 1 7 4   479223
17 1 4 1   187600
17 1 4 1   188657
17 1 4 1   179534
17 1 4 2   181016
17 0 4 2   189007
end

Since I need to cluster the matching by year and industry, I decided to run the webuse-procedure for each industry cluster (tried to solve it with a loop but that didn't work). Therefore, I thought it might be a good idea to save the original sample with the preserve-command once and then get back to it with the restore-command for each industry cluster. However, when Stata gets to the Industry cluster 2-code section as shown below it produces the error message "nothing to restore". Maybe I did not understand correctly what the preserve-restore-command actually does?

Do you have any idea what I am doing wrong?

Code:

*SAMPLE MATCHING BY NEAREST NEIGHBOUR

preserve

*Industry cluster 1
keep if ceo_power_I == 1 & industry_cluster_D == 1
tempfile controls_ic_1
ds year_cluster_D, not
rename (`r(varlist)') =_case
save `controls_ic_1'

restore
keep if ceo_power_I == 0 & industry_cluster_D == 1
tempfile matched_ic_1
ds year_cluster_D, not
rename (`r(varlist)') =_ctrl
joinby year_cluster_D using `controls_ic_1'
gen delta = abs(ta_C_ctrl - ta_C_case)
by id_N_ctrl year_cluster_D (delta), sort: keep if _n ==1
save `matched_ic_1'

*Industry cluster 2
restore
keep if ceo_power_I == 1 & industry_cluster_D == 2
tempfile controls_ic_2
ds year_cluster_D, not
rename (`r(varlist)') =_case
save `controls_ic_2'

restore
keep if ceo_power_I == 0 & industry_cluster_D == 2
tempfile matched_ic_2
ds year_cluster_D, not
rename (`r(varlist)') =_ctrl
joinby year_cluster_D using `controls_ic_2'
gen delta = abs(ta_C_ctrl - ta_C_case)
by id_N_ctrl year_cluster_D (delta), sort: keep if _n ==1
save `matched_ic_2'

Another question would be how to integrate the matched data into one sample for the regression analysis. My naive idea here is to save all matched data sets (save `matched_ic_2') and then somehow merge them all together. The current procedure eliminates all non-case observations from the sample, right? Do you maybe know a way of adding the matched non-case oberservations back to the sample to get to the balanced sample for the regression analysis?

Again thanks so much for your valuable time and help.

Best regards
Marc

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#4

10 Feb 2022, 11:52

Well, you're getting closer, but you're still thinking of this in FORTRAN-like terms.* There is no need to separately code each industry cluster's match, separately save them, and then put them back together. Here's how I would do this:

Code:

preserve keep if ceo_power_I == 0 drop ceo_power_I ds year_cluster_D industry_cluster_D, not rename (`r(varlist)') =_ctrl tempfile controls save `controls' restore keep if ceo_power_I == 1 drop ceo_power_I ds year_cluster_D industry_cluster_D, not rename (`r(varlist)') =_case joinby year_cluster_D industry_cluster_D using `controls' gen delta = abs(ta_C_case - ta_C_ctrl) by industry_cluster_D year_cluster_D id_N_case (delta), sort: keep if _n == 1

By including industry_cluster_D in the -joinby- command's key variable list, and in the final -by ... keep if _n == 1- commands you get the match on industry "for free."

Note that this code matches each id_N with ceo_power_I = 1 to another id_N having ceo_power_I = 0 and in the same year and same industry, selecting from among those, the one with the nearest value of ta_C, which is what I understand to be your goal. Selection is done with replacement. That is, if the same CEO with ceo_power_I = 0 is the best possible match for two different CEO's in the same year and industry, then that same CEO is matched to both of them. If you prefer to match without replacement, that can be done with slightly more complicated, and considerably slower-running, code. I should point out that there is no statistical reason to prefer sampling without replacement. (In fact, there is a very slight reason to prefer with replacement, though as a practical matter, there is no real difference.) But if you want that, post back and I will show you how that is done.

As for what is happening with -preserve- and -restore-, the semantics of these commands is that once -restore- is invoked, the -preserve-d copy of the data is destroyed. So you can't -restore- it again. If you find yourself in a situation where you need to repeatedly restore the same data set, use -restore, preserve- instead of naked -restore- each time you restore it (except the last). Then Stata will restore the data to active memory but still retain the preserved copy for subsequent restoration.

Added: *Stata makes minimal use of explicit loop structures. They are needed in some circumstances, particularly when you want to iteratively process a list of different variables. But, in general, before you write loops in Stata, stop and ask yourself if it can't be done better with -by-. The answer is often, perhaps usually, yes. In particular, only in unusual, complicated situations is it necessary to write loops to iteratively process groups of observations defined by the values of variables. That is, one only infrequently needs to loop over industries, or firms, or years, etc. in Stata. Learning to think this way is perhaps the most difficult adjustment that people accustomed to other programming languages have to make when working in Stata.

Last edited by Clyde Schechter; 10 Feb 2022, 11:59.
1 like
Comment
Marc Wonoadi

Join Date: Oct 2020

Posts: 7
#5

10 Feb 2022, 17:16

Clyde...

thank you so so much for your precious time and effort. Amazing how you guys are supporting the community with your valuable advice. Not only is the code working but on top of that I have learned some very helpful things for future Stata endeavours.

I got two more questions if thats possible. My goal is to create a sized matched sample and by that control the effect of size (ta_C) on my IV of interest ceo_power_I. So in order to run a regression on a balanced subsample I need it to contain both, the treated observations (ceo_power_I = 1) and the matched untreated observations (ceo_power_I = 0) with the nearest value of ta_C.

Right now the data editors shows 377 treated observations. I was hoping that there is a way to increase _N to 377*2 (equal amount of treated and size matched untreated observations). For that I guess I do need to code a selection process without replacement, right?

Thank you so much!

Best regards
Marc
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#6

10 Feb 2022, 18:56

I'm not sure I understand your present situation. Now, the sentence "Right now the data editors shows 377 treated observations" makes me wonder what you are looking at and thinking about.

Are you saying that there are some observations with ceo_power_I == 1 that went unmatched? You were expecting twice as many matches? After the code in #4 runs, every observation with ceo_power_I == 1 will be paired with an observation having ceo_power_I == 0 that has the same values for industry and year. The only way that an observation with ceo_power_I == 1 can go unpaired in this code is if the data simply doesn't contain any ceo_power_I == 0 observations in that same industry and year. If there is even a single such observation, it will be picked as the match--even if its size is a terrible match, because as the only possible match, it is still the best possible match. So any unpaired observations must be the result of gaps in the data. No change in the code can correct that. If there are some industry year combinations that have more observations with ceo_power_I == 1 but none with ceo_power_I == 0, then it is impossible to match them. But that is the only possibility for something to go unmatched.

As for using matching without replacement, that would only create more unmatched observations. Because if you match without replacement, even though there might be some suitable ceo_power_I == 0 matches for some unlucky ceo_power_I == 1 observation, if some other ceo_power_I == 1 observations from the same industry and year already paired with them, their just might be nothing left for this one. That's actually the biggest drawback to matching without replacement: it can leave observations unmatched even though suitable matches could be found for them with replacement.

If you are looking at the data as it stands after the code from #4, then you may be misinterpreting what you are seeing. The data editor will be showing you 377 matched pairs. Notice that every variable, except industry, and year, comes in two "flavors": _ctrl and _case. (Industry and year do not have two flavors because the matching mandates that it be the same for both members of the pair.) Each observation has data from two of your original observations, one of which (_case) had ceo_power_I == 1, and the other (_ctrl) had ceo_power == 0. Perhaps what you are saying is that you want this to be reorganized so that you have 377*2 observations, created by separating the current observations into the cases and controls, and creating variables identifying which is which, and which ones pair with which. That's just a few more lines of code.

Code:

gen long pair_num = _n reshape long id_N ta_C, i(pair_num) j(case_ctrl) string label define ceo_power_I 0 "_ctrl" 1 "_case" encode case_ctrl, gen(ceo_power_I) drop case_ctrl

The cases and controls are now in separate operations, like when you first started. The variable ceo_power_I has been brought back to identify them (although it is now value-labeled as ctrl_ for 0 and case_ for 1). And the variable pair_num shows you which observations form matched pairs.
1 like
Comment
Marc Wonoadi

Join Date: Oct 2020

Posts: 7
#7

11 Feb 2022, 11:29

Dear Clyde,

sorry for being too unspecific/unclear. Exactly, I wanted to reorganize the observations by separating the matched pairs into treated and controls. Thank you so so much for your code. It worked perfectly and I got a sample of 754 observations.

However, there is another issue. The matching-with-replacement approach resulted in multiple observations for the same firmyears. I guess this happened for controls that got matched multiple times to different cases. I am afraid that those dublicates would then get lost when I run a panel regression. Am I right that a matching-w/o-replacement procedure could solve that problem? I should mention that my sample is unbalanced meaning that there are more controls than cases. So we would not risk unmatched cases in case w/o replacement.

Again thanks so much for your time and effort.

Best
Marc
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#8

11 Feb 2022, 12:09

The matching-with-replacement approach resulted in multiple observations for the same firmyears.

Well, at least in the example data, there is no firm variable. I suppose that the firm is implicit in CEO. Do you mean that the same CEO (with ceo_Power_I == 0) got matched multiple times in the same year? If so, yes, that is what can happen with matching with replacement. This is usually not a problem. I don't know exactly what panel regression you plan to do, so it's hard to be more specific. With matched pair data like this, the variable pair_num would be the usual choice for the "panel." And there would be no time variable used in the -xtset- command because the observations in each pair were chosen to match on year. So I'm not sure what your concern actually is. Perhaps you can clarify what you have in mind.

That said, if you want to use matching without replacement, rerun the match from the start, with the following changes:

Code:

// REMOVE THE FOLLOWING COMMAND by sampl_case (delta), sort: keep if _n == 1 // AND RUN THESE INSTEAD sort id_N_case delta local current = 1 while `current' < _N { drop if id_N_case == id_N_case[`current'] & _n > `current' drop if id_N_ctrl == id_N_ctrl[`current'] & _n > `current' local ++current }

And the code in #6 can be rerun without modification to give you separated observations.
1 like
Comment
Marc Wonoadi

Join Date: Oct 2020

Posts: 7
#9

11 Feb 2022, 16:35

Clyde, sorry for the inaccurate description. The variable id_N is the firm identifier. Thanks so much for the code. I am beginning to understand the power of the by/sort commands . Thanks to you I also found out that my industry-year-clusters are so unbalanced in terms of ceo_power_I that _N is shrinking from 754 (matching with replacement) to 368 (matching w/o replacement).

I wanted to create a firm size (ta_C) matched sample with an equal number of treated and non-treated firm year observations and then run a panel regression (-xtset id_N yr_O-). The idea behind this robustness test was to rule out that the correlation between the IV of interest (ceo_power_I) and the DV in the original model is actually driven by size.

I should also mention that the variable yr_O is different from year_cluster_D. The reason for that is that for some single(!) year industry clusters the number of observations seemed to little for a size matching. Therefore I merged some years together. Accordingly there is some variance of yr_O for a certain pair_num. Is it possible that this leads to redundant id_N/pair_num combinations for different yr_O? I found those for both procedures (w and w/o replacement). I am wondering whether that would change anything with regard to your advice to use pair_num instead of yr_O in the -xtset- command.

Thank you so much for your support!

Best
Marc
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#10

11 Feb 2022, 16:58

Once you have created matched pairs, running a panel data regression after -xtset id_N yr_O- (or -xtset id_N year_cluster_D, or just lain -xtset id_N-) is not valid. You cannot match the data and then analyze the data as if it were unmatched. The standard errors will almost surely be wrong, and you can't even say whether they will be too large, or too small. Anything can happen. The one thing that almost surely won't happen is getting the standard errors right. And, of course, if the standard errors are wrong, then so are all the test statistics and confidence intervals.

If you run -xtset pair_num- (with or without a time variable) and then -xtreg DV i.ceo_power_I, fe- you will get an analysis that controls as completely for size as is possible. You can also include other covariates in the model (except for year_cluster_D and industry_cluster_D, which you also matched on) if you wish to further adjust the comparison for confounding variables. You can try including yr_O as a covariate. I'm not sure how it will work out. Because you created year_cluster_D by collapsing a few values of yr_O, yr_O will vary between the case and control observations in at least some of the pairs. If that happens frequently enough, there's no problem.

But if it only happens in a few cases, then it is almost colinear with the pair_num fixed effect, and its standard error will be very large, so that you will get a coefficient of yr_O with a large standard error--which will be difficult to interpret. Now, if you don't really care about the effect of yr_O and only want to include it because it may be a confounding variable, or to reduce outcome variance, then this isn't a problem anyway. But if estimating the effect of yr_O is part of your research goals and you have this problem, then this may just be a limitation of your study that you'll have to live with. The data are simply not adequate to answer that question.

Last edited by Clyde Schechter; 11 Feb 2022, 17:01.
1 like
Comment
Marc Wonoadi

Join Date: Oct 2020

Posts: 7
#11

13 Feb 2022, 21:51

Thanks alot Clyde, this was of great help. Not only with regards to coding but also in terms of my rather limited understanding of panel analysis with matched data.

I only had to adjust the data structure a little since the L. command did not work any more without -xtset firm time-. However, I ended up coding lagged variables [_n-1] as a workaround.

I hope our discussion here will be of some help for other Stata beginners too.

All the best
Marc
Comment

Announcement

Nearest neighbour matching within industry-year-clusters

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment