Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using Propensity Score Matching to Select a Sample

    I have a list of several hundred schools with 65 of them being treatment schools. I want to use propensity score matching to choose the 35 treatment and 35 control schools that are most alike based on some school-level variables. However, the taffects pmatch command requires an outcome variable which we do not have since we have not collected data yet.

    I would appreciate any advice on how we can accomplish this without an outcome variable.

  • #2
    You can't do this with the -teffects- command. You will have to set up the propensity score calculation first, then do the matching. Fortunately, that is very easy to do. Just do a logistic (or probit, if you prefer) regression of the treatment variable on whatever variables you think are relevant to predicting the treatment group. Then use the -predict- command to get predicted probabilities. Then match on those.

    If you need help with coding for any of those steps, you need to supply example data. Use the -dataex- command to do that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Hi Clyde,
      Thanks for your help. Here is the dataex output. nw_2025 is the treatment variable.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float(school_id department_id urban nw_2025 tot_num_garcons tot_num_filles sample_prob)
      
      1566   71 0 1 204 219 .005097572
       710 2246 1 1 525 540   .4063488
       690 2269 1 1 413 388   .2613704
       674 2290 1 1 247 433  .06658466
      1355 2471 1 1 170 388  .04623476
      2436 1742 0 0  58  54  .01461437
      2415 1769 0 0  87  83 .018400732
       653 2234 1 0 248 284  .08593264
       506 2311 1 0 451 726  .19807333
      2300 2453 0 0 248 283   .1077471
      
      end
      label values urban urban
      label def urban 0 "Rural", modify
      label def urban 1 "Rural", modify
      label values nw_2025 nw_20
      label def nw_20 0 "No" 1 "Yes", modify
      Here is the code I ran for the logit. We are hoping to get some additional data about the schools to add to this but this is all I have for now.

      Code:
      logit nw_2025 department_id urban tot_num_garcons tot_num_filles
      predict sample_prob
      Now, I'm unclear how to match based on the probabilities. The probabilities are the likelihood of selection into the intervention. Is that correct? So do I just select the 35 T and 35 C schools with the highest probability?

      Thanks in advance.

      Comment


      • #4
        Code:
        ds nw_2025, not
        local vbles `r(varlist)'
        
        preserve
        keep if nw_2025 == 0
        rename (`vbles') =0
        drop nw_2025
        tempfile controls
        save `controls'
        
        restore
        keep if nw_2025 == 1
        rename (`vbles') =1
        drop nw_2025
        isid school_id, sort
        
        set seed 1234    // OR WHATEVER INTEGER YOU LIKE
        cross using `controls'
        gen delta = abs(sample_prob1 - sample_prob0)
        gen double shuffle = runiform()
        by school_id1 (delta shuffle), sort: keep if _n == 1
        gen `c(obs_t)' pair_num = _n
        reshape long `vbles', i(pair_num) j(nw_2025)
        drop delta shuffle
        This will assign each of the nw_2025 == 1 observations to a single nw_2025 == 0 observation, the one which is closest to it in the value of sample_prob. If there are two or more observations tied for that criterion, one is selected (reproducibly) at random. The final data set is put into long layout, with the observations that are matched to each other sharing a common value of pair_num. This set up is usually the most convenient for further analysis of the data.

        Comment


        • #5
          Hi Clyde,
          Thanks a lot for your help. This worked perfectly.

          Comment


          • #6
            @Clyde, thank you - code is great. I am doing a similar exercise but at HH level where I would like to match around 800 treatment HHs with 1200 control HHs such that they balance.in characteristics. I am facing one small challenge in your code. I would NOT want the same HH/observation to be used for another treatment HH pairing. At the moment, your code matches the control observation which has the smallest absolute difference in predicted value, irrespective of if it has already been used for another pairing. Any idea how we can go about doing this?

            Thank you,

            Comment


            • #7
              Code:
              ds nw_2025, not
              local vbles `r(varlist)'
              
              preserve
              keep if nw_2025 == 0
              rename (`vbles') =0
              drop nw_2025
              tempfile controls
              save `controls'
              
              restore
              keep if nw_2025 == 1
              rename (`vbles') =1
              drop nw_2025
              isid school_id, sort
              
              set seed 1234    // OR WHATEVER INTEGER YOU LIKE
              cross using `controls'
              gen delta = abs(sample_prob1 - sample_prob0)
              gen double shuffle = runiform()
              // by school_id1 (delta shuffle), sort: keep if _n == 1
              sort school_id1 delta shuffle
              local current 1
              while `current' <= c(N) {    
                  drop if school_id1 == school_id1[`current'] ///
                      & school_id0 != school_id0[`current']
                  drop if school_id0 == school_id0[`current'] ///
                      & school_id1 != school_id1[`current']
                  local ++current
              }
              gen `c(obs_t)' pair_num = _n
              reshape long `vbles', i(pair_num) j(nw_2025)
              drop /*delta*/ shuffle
              Changes shown in bold face.

              Given that you are simply matching nearest neighbor on a score (among the controls not already taken by another case), and you have more controls thn cases, there is no danger of a case going without a match. But it is entirely possible that some of the last cases to be matched (highest number school_id's) will have terrible matches. (I have not -drop-ped delta in this code so you can check to see if this has happened yourself. Let me point out that there is no statistical advantage to avoiding the re-use of controls in matching. I know that many people prefer it, but it has no justification other than aesthetic. And if you do end up with some of the pairs being badly matched, it is all downside with no compensating upside.

              Comment


              • #8
                Thank you

                Comment

                Working...
                X