Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating observation pairs in Stata - balanced sample, no replacement

    Hello,

    I am trying to create a balanced sample of paired observations in Stata based on a treated and control group. Each firm-year observation in my treated group should match with one firm-year observation in my control group. First, the observations must match on industry and year, then on assets. The asset match must be the closest value (so perhaps a nearest neighbor match?). While the industry and year match needs to be exact.

    How should I go about creating this sample without replacement? I have many more control firms than treated firms. The dataset looks something like this:
    Firm Year Treatment Industry Assets
    1 2020 0 1 140
    2 2019 0 2 50
    3 2019 1 2 100
    4 2020 1 1 150
    5 2020 0 1 200
    6 2019 0 2 90
    7 2018 0 2 25
    8 2020 0 2 300

    In this example, I would expect Firm # 3 to match with #6 and Firm #4 to match with #1. Giving me a sample of 4 observations (2 treated and 2 control).

    Thank you in advance for the help.

  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear*
    input byte firm int year byte(treatment industry) int assets
    1 2020 0 1 140
    2 2019 0 2  50
    3 2019 1 2 100
    4 2020 1 1 150
    5 2020 0 1 200
    6 2019 0 2  90
    7 2018 0 2  25
    8 2020 0 2 300
    end
    
    isid firm
    
    preserve
    keep if treatment == 0
    drop treatment
    rename (firm assets) =_control
    tempfile controls
    save `controls'
    
    restore
    keep if treatment == 1
    drop treatment
    
    joinby industry year using `controls', unmatched(master)
    
    gen delta = abs(assets - assets_control)
    sort firm delta
    
    local i = 1
    while `i' <= _N {
        drop if firm == firm[`i'] & _n > `i'
        drop if firm_control == firm_control[`i'] & _n > `i'
        local ++i
    }
    Because you requested code for sampling without replacement, I have provided it. Be aware that there is no statistical reason to prefer sampling without replacement (and there are reasons to prefer sampling with replacement, though they are not compelling) and the preference many people have for sampling without replacement is purely esthetic. There is one strong reason, however, to prefer sampling with replacement for your setup. If there are two firms in the same year and industry that have the same amount of assets, they are competing for the same best-matched control and one of them must lose and settle for an inferior match (unless there is also a pair of equally good matching controls). In sampling with replacement, both could receive an optimal match.

    If your data set is very large, this is going to be fairly slow, so be patient. (Sampling with replacement would be faster.)

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Comment


    • #3
      Thank you Clyde for the help here. Unfortunately, I made a mistake in the sample dataset. This is panel data with repeating firm numbers. I have firm-year observations which I think is preventing the code from working.

      The code also drops my treatment variable, which I need to run my regressions. I am expecting a balanced sample with a control observation (treatment = 0) for each treated observation (treatment = 1). Then I can run a logit regression to estimate the treatment.

      I will use the -dataex- feature for future posts as you mention.

      Comment


      • #4
        If you have panel data, then you have not specified how you want the matching to work. Firms that match in one year may not match well in other years. And it is usually completely incoherent to match each firm with a different control in each year. So you need to spell out how you want the matching to work in this instance.

        As for the loss of the treatment variable and having a balanced sample, you can create that easily from what the code in #2 by -reshape-ing the data to wide layout.

        Comment


        • #5
          Clyde, I've added new sample data below. I would like to match each treated firm-year observation to one firm-year observation from the control group.

          The match would be an exact match on industry and year, and a nearest neighbor match on assets. Once a control firm-year observation is used as a match, it cannot be used again (i.e. no replacement). So the sample should be balanced.

          The treatment variable is the main independent variable in my regressions.

          *data*
          input byte firm int year byte(treatment industry) int assets

          7620 2017 0 13 9861
          170750 2018 0 13 21596
          20548 2019 0 13 4487
          162459 2019 0 13 2156
          14359 2018 1 13 17903
          11923 2018 0 13 40376
          26069 2018 1 13 6051
          10581 2017 0 35 3402
          2444 2018 1 35 4286
          6737 2017 0 35 2407
          5087 2019 0 35 2135
          7991 2018 0 35 3486
          14898 2018 0 35 826
          1704 2017 0 35 19419
          6509 2017 0 35 1171
          5878 2017 0 35 16780
          142260 2017 0 35 1302
          137602 2018 0 35 548
          5252 2019 0 35 1692
          12262 2019 0 35 800
          4058 2017 1 35 10658
          2817 2019 0 35 78453
          15267 2019 0 35 3814
          11399 2019 0 35 26370
          Last edited by Jason Damm; 02 Dec 2020, 17:21.

          Comment


          • #6
            Your new example was not produced with -dataex-. You hand edited an imitation. So go back and run that yourself so you can see that it doesn't work: it produces all missing values for the firm variable. ALWAYS USE -dataex- TO SHOW EXAMPLE DATA. ALWAYS. YES, I'M SHOUTING AT YOU. It is a waste of your time to type out and send me data that can't be used, and it is a waste of my time to fix it so it becomes usable.

            After fixing that problem, I notice that there are a couple of observations in your new data for which the value of assets is missing. I'm assuming that such observations should never be matched to anything, and I'm deleting them. Since the possibility of missing data has been raised, I have also asked Stata to delete any observations with missing firm, year, treatment, or industry. There aren't any such in your example, but perhaps there are in the full data.

            I've extended the code to give you the layout you want with the case-control pairs in two observations rather than side-by-side. Other than that, it's really a very minor modification of the code in #2.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input long firm int year byte(treatment industry) int assets
              7620 2017 0 13  9861
            170750 2018 0 13 21596
             20548 2019 0 13  4487
            162459 2019 0 13  2156
             14359 2018 1 13 17903
             11923 2018 0 13     .
             26069 2018 1 13  6051
             10581 2017 0 35  3402
              2444 2018 1 35  4286
              6737 2017 0 35  2407
              5087 2019 0 35  2135
              7991 2018 0 35  3486
             14898 2018 0 35   826
              1704 2017 0 35 19419
              6509 2017 0 35  1171
              5878 2017 0 35 16780
            142260 2017 0 35  1302
            137602 2018 0 35   548
              5252 2019 0 35  1692
             12262 2019 0 35   800
              4058 2017 1 35 10658
              2817 2019 0 35     .
             15267 2019 0 35  3814
             11399 2019 0 35 26370
            end
            
            isid firm year
            drop if missing(firm, year, industry, assets, treatment)
            
            preserve
            keep if treatment == 0
            drop treatment
            rename (firm assets) =_control
            tempfile controls
            save `controls'
            
            restore
            keep if treatment == 1
            drop treatment
            
            joinby industry year using `controls', unmatched(master)
            
            gen delta = abs(assets - assets_control)
            sort firm delta
            
            local i = 1
            while `i' <= _N {
                drop if firm == firm[`i'] & _n > `i'
                drop if firm_control == firm_control[`i'] & _n > `i'
                local ++i
            }
            
            rename (firm assets) =_case
            gen long pair = _n
            drop _merge delta
            
            reshape long firm assets, i(pair) j(case_control) string

            Comment


            • #7
              Hi, Clyde, I tried your code using my test data but it seems that it can not match all the treat firms required with the control firms.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Your new example was not produced with -dataex-. You hand edited an imitation. So go back and run that yourself so you can see that it doesn't work: it produces all missing values for the firm variable. ALWAYS USE -dataex- TO SHOW EXAMPLE DATA. ALWAYS. YES, I'M SHOUTING AT YOU. It is a waste of your time to type out and send me data that can't be used, and it is a waste of my time to fix it so it becomes usable.

                After fixing that problem, I notice that there are a couple of observations in your new data for which the value of assets is missing. I'm assuming that such observations should never be matched to anything, and I'm deleting them. Since the possibility of missing data has been raised, I have also asked Stata to delete any observations with missing firm, year, treatment, or industry. There aren't any such in your example, but perhaps there are in the full data.

                I've extended the code to give you the layout you want with the case-control pairs in two observations rather than side-by-side. Other than that, it's really a very minor modification of the code in #2.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input long firm int year byte(treatment industry) int assets
                7620 2017 0 13 9861
                170750 2018 0 13 21596
                20548 2019 0 13 4487
                162459 2019 0 13 2156
                14359 2018 1 13 17903
                11923 2018 0 13 .
                26069 2018 1 13 6051
                10581 2017 0 35 3402
                2444 2018 1 35 4286
                6737 2017 0 35 2407
                5087 2019 0 35 2135
                7991 2018 0 35 3486
                14898 2018 0 35 826
                1704 2017 0 35 19419
                6509 2017 0 35 1171
                5878 2017 0 35 16780
                142260 2017 0 35 1302
                137602 2018 0 35 548
                5252 2019 0 35 1692
                12262 2019 0 35 800
                4058 2017 1 35 10658
                2817 2019 0 35 .
                15267 2019 0 35 3814
                11399 2019 0 35 26370
                end
                
                isid firm year
                drop if missing(firm, year, industry, assets, treatment)
                
                preserve
                keep if treatment == 0
                drop treatment
                rename (firm assets) =_control
                tempfile controls
                save `controls'
                
                restore
                keep if treatment == 1
                drop treatment
                
                joinby industry year using `controls', unmatched(master)
                
                gen delta = abs(assets - assets_control)
                sort firm delta
                
                local i = 1
                while `i' <= _N {
                drop if firm == firm[`i'] & _n > `i'
                drop if firm_control == firm_control[`i'] & _n > `i'
                local ++i
                }
                
                rename (firm assets) =_case
                gen long pair = _n
                drop _merge delta
                
                reshape long firm assets, i(pair) j(case_control) string
                Hi, Clyde, I tried your code using my test data but it seems that it can not match all the treat firms required with the control firms.

                Comment


                • #9
                  Well, that is one of the limitations of matching: sometimes there are cases for which no appropriately matching control exists.

                  I would strongly suggest you abandon the "no replacement" requirement. It may well be that some of the unmatched firms could be matched with a control in your data, but that control is already "taken" by another firm. Abandoning the "no replacement" requirement will resolve such cases. And as I pointed out in #2, there is no statistical reason to insist on no replacement.

                  Given the nature of your matching, that should enable you to match every case with a control, unless there are some years in which you have data for cases but no data for controls in some industry. In that case, consider using a coarser grouping of firms into industries, that is, treat as a single industry a small group of industries that are, for purposes of your problem, sufficiently similar.

                  Comment

                  Working...
                  X