Using Propensity Score Matching to Select a Sample

Daniel Zaas

Join Date: Jan 2019

Posts: 13
#1

Using Propensity Score Matching to Select a Sample

01 Nov 2024, 10:51

I have a list of several hundred schools with 65 of them being treatment schools. I want to use propensity score matching to choose the 35 treatment and 35 control schools that are most alike based on some school-level variables. However, the taffects pmatch command requires an outcome variable which we do not have since we have not collected data yet.

I would appreciate any advice on how we can accomplish this without an outcome variable.
Tags: matching, psm, sampling
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#2

01 Nov 2024, 11:47

You can't do this with the -teffects- command. You will have to set up the propensity score calculation first, then do the matching. Fortunately, that is very easy to do. Just do a logistic (or probit, if you prefer) regression of the treatment variable on whatever variables you think are relevant to predicting the treatment group. Then use the -predict- command to get predicted probabilities. Then match on those.

If you need help with coding for any of those steps, you need to supply example data. Use the -dataex- command to do that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment

Daniel Zaas

Join Date: Jan 2019
Posts: 13

01 Nov 2024, 14:10

Hi Clyde,
Thanks for your help. Here is the dataex output. nw_2025 is the treatment variable.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(school_id department_id urban nw_2025 tot_num_garcons tot_num_filles sample_prob)

1566   71 0 1 204 219 .005097572
 710 2246 1 1 525 540   .4063488
 690 2269 1 1 413 388   .2613704
 674 2290 1 1 247 433  .06658466
1355 2471 1 1 170 388  .04623476
2436 1742 0 0  58  54  .01461437
2415 1769 0 0  87  83 .018400732
 653 2234 1 0 248 284  .08593264
 506 2311 1 0 451 726  .19807333
2300 2453 0 0 248 283   .1077471

end
label values urban urban
label def urban 0 "Rural", modify
label def urban 1 "Rural", modify
label values nw_2025 nw_20
label def nw_20 0 "No" 1 "Yes", modify

Here is the code I ran for the logit. We are hoping to get some additional data about the schools to add to this but this is all I have for now.

Code:

logit nw_2025 department_id urban tot_num_garcons tot_num_filles
predict sample_prob

Now, I'm unclear how to match based on the probabilities. The probabilities are the likelihood of selection into the intervention. Is that correct? So do I just select the 35 T and 35 C schools with the highest probability?

Thanks in advance.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#4

01 Nov 2024, 14:56

Code:

ds nw_2025, not local vbles `r(varlist)' preserve keep if nw_2025 == 0 rename (`vbles') =0 drop nw_2025 tempfile controls save `controls' restore keep if nw_2025 == 1 rename (`vbles') =1 drop nw_2025 isid school_id, sort set seed 1234 // OR WHATEVER INTEGER YOU LIKE cross using `controls' gen delta = abs(sample_prob1 - sample_prob0) gen double shuffle = runiform() by school_id1 (delta shuffle), sort: keep if _n == 1 gen `c(obs_t)' pair_num = _n reshape long `vbles', i(pair_num) j(nw_2025) drop delta shuffle

This will assign each of the nw_2025 == 1 observations to a single nw_2025 == 0 observation, the one which is closest to it in the value of sample_prob. If there are two or more observations tied for that criterion, one is selected (reproducibly) at random. The final data set is put into long layout, with the observations that are matched to each other sharing a common value of pair_num. This set up is usually the most convenient for further analysis of the data.
1 like
Comment
Daniel Zaas

Join Date: Jan 2019

Posts: 13
#5

04 Nov 2024, 09:19

Hi Clyde,
Thanks a lot for your help. This worked perfectly.
Comment
Kamran Khan Niazi

Join Date: Apr 2020

Posts: 6
#6

19 Dec 2024, 01:54

@Clyde, thank you - code is great. I am doing a similar exercise but at HH level where I would like to match around 800 treatment HHs with 1200 control HHs such that they balance.in characteristics. I am facing one small challenge in your code. I would NOT want the same HH/observation to be used for another treatment HH pairing. At the moment, your code matches the control observation which has the smallest absolute difference in predicted value, irrespective of if it has already been used for another pairing. Any idea how we can go about doing this?

Thank you,
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29953
#7

19 Dec 2024, 08:37

Code:

ds nw_2025, not local vbles `r(varlist)' preserve keep if nw_2025 == 0 rename (`vbles') =0 drop nw_2025 tempfile controls save `controls' restore keep if nw_2025 == 1 rename (`vbles') =1 drop nw_2025 isid school_id, sort set seed 1234 // OR WHATEVER INTEGER YOU LIKE cross using `controls' gen delta = abs(sample_prob1 - sample_prob0) gen double shuffle = runiform() // by school_id1 (delta shuffle), sort: keep if _n == 1 sort school_id1 delta shuffle local current 1 while `current' <= c(N) { drop if school_id1 == school_id1[`current'] /// & school_id0 != school_id0[`current'] drop if school_id0 == school_id0[`current'] /// & school_id1 != school_id1[`current'] local ++current } gen `c(obs_t)' pair_num = _n reshape long `vbles', i(pair_num) j(nw_2025) drop /*delta*/ shuffle

Changes shown in bold face.

Given that you are simply matching nearest neighbor on a score (among the controls not already taken by another case), and you have more controls thn cases, there is no danger of a case going without a match. But it is entirely possible that some of the last cases to be matched (highest number school_id's) will have terrible matches. (I have not -drop-ped delta in this code so you can check to see if this has happened yourself. Let me point out that there is no statistical advantage to avoiding the re-use of controls in matching. I know that many people prefer it, but it has no justification other than aesthetic. And if you do end up with some of the pairs being badly matched, it is all downside with no compensating upside.
1 like
Comment
Kamran Khan Niazi

Join Date: Apr 2020

Posts: 6
#8

26 Dec 2024, 06:40

Thank you
Comment

Announcement