Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding matches based on exact country and industry criteria,and the closest propensity score

    Dear Statalist,

    I'm currently working on my master dissertation, and I'm facing difficulties in finding the appropriate codes for what seems like a relatively simple problem. Given the pressing nature of my project and the limited time available, I kindly request the assistance of anyone who has experience in this topic. Your support would be very meaningful to me.

    In my dataset, I have 1,141 treated firms and 37,000 control firms. I aim to match each treated firm with a control firm based on an exact match of country and industry codes, as well as the nearest propensity score (which I have already computed). My desired outcome is a table that displays the matched pairs. Despite searching through the forum and attempting different codes, I have been unsuccessful in achieving this. I would greatly appreciate any assistance with this matter.

    Regarding the country and industry codes, I am unsure whether they should be set as "string" or "numeric" data types. I'm also curious about the potential impact on the execution of the code.

    If I want to consider two scenarios: (1) one control firm for each treated firm; and (2) one control firm can be matched with multiple treated firms, what should the codes be for these two particular scenarios?


    Thank you in advance for any help you can provide.

  • #2
    I doubt anyone can provide more than vague, general advice without having example data to work with.

    Please post back, using the -dataex- command to do show that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    In choosing the example data to show, please be sure to include all of the variables needed for the matching. Also include both some treated and some control firms, and be sure that some of the included ones are potential matches.

    Regarding the data types of the country and industry codes, it does not matter for purposes of code development which you choose--the code will be the same. The use of numeric variables would be somewhat faster in a very large data set, but I doubt it would make a noticeable difference in one of the size you describe.

    Comment


    • #3
      Hi Mr. Schechter,

      I am so glad to get a reply from you. Thank you very much for your guide.

      Following is an illustration of my dataframe. I have ISIN, treatment (=1 for treated firm and 0 for control firm), country code, industry code, market capitalisation, and price-to-book value. I am aiming to match treated firms and control firms in pairs on exact country and industry, and the nearest market_cap and price_to_book. I am supposed to compute propensity scores using market_cap and price_to_book, then match on the nearest propensity scores. Could you please help me with this?

      Code:
      input byte id str12 isin byte treatment str2 country byte industry float(market_cap price_to_book)
       1 "DE0007664005" 1 "DE" 29   43540.5   .244
       2 "DE000UNSE018" 1 "DE" 35   951.496   .215
       3 "DE0007100000" 1 "DE" 29   65805.7    .76
       4 "IT0003128367" 1 "IT" 35   51138.4  1.215
       5 "IT0003132476" 1 "IT"  6  47903.93   .867
       6 "DE0005557508" 1 "DE" 61   93236.8  1.068
       7 "DE000ENAG999" 1 "DE" 35 24664.637  1.128
       8 "DE0005552004" 1 "DE" 53  43794.55  1.848
       9 "DE0008404005" 1 "DE" 65  81287.94  1.579
      10 "DE000BASF111" 1 "DE" 20  41541.91  1.015
      12 "FR0010208488" 1 "FR" 35 32603.596    .83
      14 "FR0010208488" 1 "FR" 35 32603.596    .83
      16 "FR0014003U94" 1 "FR" 45   354.671  1.683
      18 "FR0000073272" 1 "FR" 30  49953.41  4.597
      20 "FR0000035164" 1 "FR" 30    1135.5    1.8
      22 "IT0005037210" 1 "IT" 70  1076.322  2.677
      24 "DE0005810055" 1 "DE" 66     31027  3.424
      26 "DE000A0ETBQ4" 1 "DE" 66   541.202   .703
      28 "DE0005493365" 1 "DE" 66   639.922  2.346
      30 "DE000FTG1111" 1 "DE" 66   703.092  1.156
      32 "IT0003097257" 1 "IT" 28     347.5    1.3
      34 "GB00BYM8GJ06" 1 "GB" 73  1004.122   1.18
      36 "GB00BGDT3G23" 1 "GB" 73  4774.876 61.913
      38 "GB00B5NR1S72" 1 "GB" 74   511.243  1.654
      40 "GB00B61TVQ02" 1 "GB" 45  3474.502   1.96
      42 "IT0005452658" 0 "IT" 70     572.2    5.5
      44 "NL0015000N33" 0 "IT" 70  1003.061   .991
      46 "FR0010220475" 0 "FR" 30  9541.772  1.048
      48 "FR0000032278" 0 "FR" 30     196.5    1.7
      50 "FR0014007LQ2" 0 "FR" 45  44.93958  1.969
      52 "FR0013030152" 0 "FR" 35 264.84802  4.198
      54 "FR0012532810" 0 "FR" 35       643    5.2
      56 "DE0006095003" 0 "DE" 66  2997.577  3.133
      58 "DE000A161077" 0 "DE" 66 142.38539   .686
      60 "DE000A0B9N37" 0 "DE" 66  228.9467  6.193
      62 "DE000A2GSU42" 0 "DE" 66 239.75325    .76
      64 "DE0005408686" 0 "DE" 66     377.9      2
      66 "DE0008148206" 0 "DE" 66 103.45737   .395
      68 "IT0001237053" 0 "IT" 28 191.14803    .69
      70 "IT0005107492" 0 "IT" 28 223.59818  1.057
      72 "JE00B8KF9B49" 0 "GB" 73  9938.426  2.111
      74 "IM00BQ8NYV14" 0 "GB" 73 1316.8152  1.197
      76 "GB00B01F7T14" 0 "GB" 73 208.87483  4.834
      78 "GB00BDVZYZ77" 0 "GB" 74    2108.1     .4
      80 "GB00B19NLV48" 0 "GB" 74   27868.4    8.8
      82 "GB0009697037" 0 "GB" 74    1601.4    2.1
      84 "GB00BLGXWY71" 0 "GB" 74  618.8959  1.955
      86 "GB00BYQB9V88" 0 "GB" 45   786.445  1.162
      88 "GB00BD0SFR60" 0 "GB" 45     183.5    3.2
      90 "GB00BVYVFW23" 0 "GB" 45    5523.5    9.3

      Comment


      • #4
        Thanks for the good example data.


        To start with, we match up every case (treated) with all controls that are exact matches on country and industry. Then we calculate the absolute difference in propensity score, and sort the data so that the first control for each case is the one with the closest propensity score. (If there are two or more controls tied for closest propensity score, the tie is broken by a random choice.)
        Code:
        //    VERIFY ID IS AN IDENTIFIER
        isid id
        
        //    CALCULATE PROPENSITY SCORE
        assert inlist(treatment, 0, 1)
        logistic treatment market_cap price_to_book
        predict pscore
        
        //    MATCHING
        local match_vars country industry
        ds treatment `match_vars', not
        local non_match_vars `r(varlist)'
        
        preserve
        keep if !treatment
        rename (`non_match_vars') =_ctrl
        drop treatment
        tempfile controls
        save `controls'
        
        restore
        keep if treatment
        rename (`non_match_vars') =_case
        drop treatment
        joinby `match_vars' using `controls',
        
        //    NOW SELECT CLOSEST PROPENSITYSCORE MATCH, BREAKING TIES AT RANDOM BUT REPRODUCIBLY
        gen delta = abs(pscore_case - pscore_ctrl)
        set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
        gen double shuffle = runiform()
        From this point, the code differs in the two scenarios you outlined in #1. The following shows the way to do your second scenario:
        Code:
        //    THIS WAY ALLOWS THE SAME CONTROL TO MATCH TO MULTPLE TREATED CASES
        by id_case (delta shuffle), sort: keep if _n == 1

        And the following shows the way you would do it for your first scenario.
        Code:
        //    THIS WAY RESTRICTS CONTROLS TO MATCHING ONLY ONE CASE
        sort id_case delta shuffle
        local i = 1
        while `i' < _N {
            drop if id_case == id_case[`i'] & _n > `i'
            drop if id_ctrl == id_ctrl[`i'] & _n > `i'
            local ++i
        }
        The end result is a set of matched pairs that agree on country and industry and are, as closely as possible, matched on propensity score. (Any cases or controls that found no match will have been eliminated.)

        As between the two scenarios, allowing controls to match only one case seems to be very popular, I suppose for aesthetic reasons. But the statistical reality is that this restriction offers no advantages and has an important drawback. When you restrict each control to being only used once, it may be that some case that has multiple potential matches draws away the only good control for some other case. That other case may be either left unmatched altogether, or end up with a substantially inferior match. (This doesn't actually happen in your example data, but that doesn't mean it won't in your full data set.)

        Comment


        • #5
          Thank you very much for your detailed instructions. I have successfully matched the firms by both ways following your codes and that is a big step completed before further analysis. You assistant means a lot to me. I hope you have a great weekend.

          Comment


          • #6
            Originally posted by Clyde Schechter View Post
            Thanks for the good example data.


            To start with, we match up every case (treated) with all controls that are exact matches on country and industry. Then we calculate the absolute difference in propensity score, and sort the data so that the first control for each case is the one with the closest propensity score. (If there are two or more controls tied for closest propensity score, the tie is broken by a random choice.)
            Code:
            // VERIFY ID IS AN IDENTIFIER
            isid id
            
            // CALCULATE PROPENSITY SCORE
            assert inlist(treatment, 0, 1)
            logistic treatment market_cap price_to_book
            predict pscore
            
            // MATCHING
            local match_vars country industry
            ds treatment `match_vars', not
            local non_match_vars `r(varlist)'
            
            preserve
            keep if !treatment
            rename (`non_match_vars') =_ctrl
            drop treatment
            tempfile controls
            save `controls'
            
            restore
            keep if treatment
            rename (`non_match_vars') =_case
            drop treatment
            joinby `match_vars' using `controls',
            
            // NOW SELECT CLOSEST PROPENSITYSCORE MATCH, BREAKING TIES AT RANDOM BUT REPRODUCIBLY
            gen delta = abs(pscore_case - pscore_ctrl)
            set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
            gen double shuffle = runiform()
            From this point, the code differs in the two scenarios you outlined in #1. The following shows the way to do your second scenario:
            Code:
            // THIS WAY ALLOWS THE SAME CONTROL TO MATCH TO MULTPLE TREATED CASES
            by id_case (delta shuffle), sort: keep if _n == 1

            And the following shows the way you would do it for your first scenario.
            Code:
            // THIS WAY RESTRICTS CONTROLS TO MATCHING ONLY ONE CASE
            sort id_case delta shuffle
            local i = 1
            while `i' < _N {
            drop if id_case == id_case[`i'] & _n > `i'
            drop if id_ctrl == id_ctrl[`i'] & _n > `i'
            local ++i
            }
            The end result is a set of matched pairs that agree on country and industry and are, as closely as possible, matched on propensity score. (Any cases or controls that found no match will have been eliminated.)

            As between the two scenarios, allowing controls to match only one case seems to be very popular, I suppose for aesthetic reasons. But the statistical reality is that this restriction offers no advantages and has an important drawback. When you restrict each control to being only used once, it may be that some case that has multiple potential matches draws away the only good control for some other case. That other case may be either left unmatched altogether, or end up with a substantially inferior match. (This doesn't actually happen in your example data, but that doesn't mean it won't in your full data set.)
            Dear Clyde,


            I appreciate your previous assistance with propensity score matching. Could I (again) kindly request your help with the codes for entropy balancing using the same database? I have searched around and learned how to reweigh the variables, but I am uncertain about the subsequent steps to obtain pairs of firms similar to those obtained through propensity score matching. Additionally, I am unsure about how to define the conditions for firms in each pair to have the same country and industry before reweighting the 02 variables: market capitalisation and price-to-book ratio. Thank you very much and I really hope to have your support on this matter.

            Comment


            • #7
              Sorry to disappoint, but I don't know anything about this technique. Hopefully somebody else following this thread who does will chime in. If nobody provides a helpful response within, say, 24 hours, I suggest you repost as a new thread.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Sorry to disappoint, but I don't know anything about this technique. Hopefully somebody else following this thread who does will chime in. If nobody provides a helpful response within, say, 24 hours, I suggest you repost as a new thread.
                Thanks a lot for getting back to me. I would wait to see if someone knows about it. Wish you a pleasant day

                Comment


                • #9
                  Originally posted by Clyde Schechter View Post
                  Thanks for the good example data.


                  To start with, we match up every case (treated) with all controls that are exact matches on country and industry. Then we calculate the absolute difference in propensity score, and sort the data so that the first control for each case is the one with the closest propensity score. (If there are two or more controls tied for closest propensity score, the tie is broken by a random choice.)
                  Code:
                  // VERIFY ID IS AN IDENTIFIER
                  isid id
                  
                  // CALCULATE PROPENSITY SCORE
                  assert inlist(treatment, 0, 1)
                  logistic treatment market_cap price_to_book
                  predict pscore
                  
                  // MATCHING
                  local match_vars country industry
                  ds treatment `match_vars', not
                  local non_match_vars `r(varlist)'
                  
                  preserve
                  keep if !treatment
                  rename (`non_match_vars') =_ctrl
                  drop treatment
                  tempfile controls
                  save `controls'
                  
                  restore
                  keep if treatment
                  rename (`non_match_vars') =_case
                  drop treatment
                  joinby `match_vars' using `controls',
                  
                  // NOW SELECT CLOSEST PROPENSITYSCORE MATCH, BREAKING TIES AT RANDOM BUT REPRODUCIBLY
                  gen delta = abs(pscore_case - pscore_ctrl)
                  set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
                  gen double shuffle = runiform()
                  From this point, the code differs in the two scenarios you outlined in #1. The following shows the way to do your second scenario:
                  Code:
                  // THIS WAY ALLOWS THE SAME CONTROL TO MATCH TO MULTPLE TREATED CASES
                  by id_case (delta shuffle), sort: keep if _n == 1

                  And the following shows the way you would do it for your first scenario.
                  Code:
                  // THIS WAY RESTRICTS CONTROLS TO MATCHING ONLY ONE CASE
                  sort id_case delta shuffle
                  local i = 1
                  while `i' < _N {
                  drop if id_case == id_case[`i'] & _n > `i'
                  drop if id_ctrl == id_ctrl[`i'] & _n > `i'
                  local ++i
                  }
                  The end result is a set of matched pairs that agree on country and industry and are, as closely as possible, matched on propensity score. (Any cases or controls that found no match will have been eliminated.)

                  As between the two scenarios, allowing controls to match only one case seems to be very popular, I suppose for aesthetic reasons. But the statistical reality is that this restriction offers no advantages and has an important drawback. When you restrict each control to being only used once, it may be that some case that has multiple potential matches draws away the only good control for some other case. That other case may be either left unmatched altogether, or end up with a substantially inferior match. (This doesn't actually happen in your example data, but that doesn't mean it won't in your full data set.)
                  Dear Clyde Schechter,

                  I would be grateful if you could take some time once more to help me with propensity score matching. I am working on another dataset that requires propensity score matching, which is a bit different from the previous one.

                  In my dataset, I have treated and control funds. The propensity score is computed based on fund_age, net_asset, and return. The treatment values take 1 for treated funds and 0 for control funds. My desired outcome is a table that displays the matched pairs. If possible, could you please help me with the codes to match: (1) one control fund for each treated fund; and (2) one control fund matched with multiple treated funds?

                  I have attached a part of my dataset below.

                  Also, I would like to add a small question. My variable net_asset is read as string type by Stata, which is strange because it is supposed to be numeric. I have tried to change to data format but it did not work. So, I have to encode it into numeric every time I use this database. Could you please share your experience on this issue?

                  Thank you very much for your time reading this post. Your assistance is very meaningful to me.

                  Code:
                  input str44 name str10 fund_id byte fund_age str18 net_asset float return byte treatment
                  "Harris Associates Kokusai S/A USD"         "FS00008KNR" 10 "23,539,027.31"        6.84 1
                  "BL-Equities Japan B EUR Hedged"            "FS00008KQC"  8 "2,114,354.39"         5.33 1
                  "ODDO BHF Emerging ConDmd CIW EUR Acc"      "FS00008KRV" 10 "154,587,573.00"        .98 1
                  "SPDR® MSCI ACWI ETF"                      "FS00008KT6" 10 "2,166,734,385.00"      6.1 1
                  "SPDR® MSCI ACWI IMI UCITS ETF"            "FS00008KT7" 10 "283,345,104.80"        6.2 1
                  "SPDR® MSCI EM Asia ETF"                   "FS00008KT8" 10 "1,267,964,232.00"      .19 1
                  "SPDR® MSCI Emerging Markets ETF"          "FS00008KTB" 10 "457,344,740.70"       1.78 1
                  "SPDR® MSCI Emerging Markets SmallCap ETF" "FS00008KTC" 10 "135,820,252.10"       4.78 1
                  "Invesco Pan European Focus Eq A EUR Acc"   "FS00008L1N" 10 "7,556,821.00"         8.45 1
                  "Neuberger Berman US Sm Cap EUR A Acc"      "FS00008L2Q"  0 "729,826.12"           1.45 1
                  "UBS FS MSCI Emerg Mkts SF USD A acc ETF"   "FS00008LJO" 10 "481,120,764.60"       1.69 1
                  "Pictet-China Index I EUR"                  "FS00008LK2"  3 "93,395,833.00"       -3.26 1
                  "UBS FS S&P 500 SF USD A acc ETF"           "FS00008LMV" 10 "126,892,677.00"       7.77 1
                  "Vontobel Fd II mtxEmMktsSstbyChampNGEUR"   "FS00008LOW" -1 "158,589,108.70"       1.54 1
                  "Sextant Tech A"                            "FS00008LRJ" 10 "8,745,000.00"         5.37 1
                  "Templeton European Div A(acc)EUR"          "FS00008MD9" 10 "8,541,244.00"         3.25 1
                  "Mirabaud Eqs Swiss Sm & Mid I EUR Acc"     "FS00008MDY"  6 "187,499,218.50"       4.83 1
                  "iShares Europe Index (IE) D Acc EUR"       "FS00008MO0"  4 "4,497,979.00"         6.52 1
                  "Abeille Capital Planète"                  "FS00008MVC" 10 "9,429,000.00"       1.6996 1
                  "TA-ITA Azioni"                             "FS00008MVH" 13 "73,915,000.00"        7.65 1
                  "Pharus SICAV EOS A1 EUR Acc"               "FS00008N1A" 10 "12,480,828.00"         8.5 1
                  "Portfolio Wachstum ZKB Oe I T"             "FS00008N1D"  5 "19,544,953.00"        1.44 1
                  "Portfolio Wachstum (Euro) Alt ZKB Oe I T"  "FS00008N1E"  5 "31,397,515.00"        1.24 1
                  "LBPAM ISR Actions Emergents L"             "FS00008N2O" -1 "40,524,000.00"        1.18 1
                  "GAM Sustainable Emerg Eq EUR Acc"          "FS00008N8J" 10 "2,300,000.00"         1.71 1
                  "GAM Star Capital Apprec US Eq GBP Acc"     "FS00008NDX"  2 "76,544.80"            6.75 1
                  "William Blair EM Leaders D USD Acc"        "FS00008NEV" 10 "4,280,819.49"          -.7 1
                  "MFS Meridian Blnd Rsrch Eurp Eq A1 EUR"    "FS00008NHD" 10 "3,584,479.00"         7.64 1
                  "Mirae Asset ESG Asia Grt Cnsmr Eq A EUR"   "FS00008NMU"  8 "3,045,012.38"        -5.23 1
                  "ACATIS Global Value Total Return"          "FS00008NOQ" 10 "45,163,062.00"     5.49962 1
                  "Norron Active RC SEK"                      "FS00008NOT" 10 "16,404.07"            5.34 1
                  "Sands Capital Global Growth A EUR Acc"     "FS00008NQV"  6 "25,815,530.00"        -.13 1
                  "Artisan Global Value I EUR Acc"            "FS00008NS0"  5 "15,915,930.33"        9.13 1
                  "Handelsbanken USA Ind Crit A1 EUR"         "FS00008NSF"  6 "183,611,600.00"       6.74 1
                  "Handelsbanken Sverige 100 Ind Cri A1 SEK"  "FS00008NSG" 10 "670,488,902.30"       7.81 1
                  "HSBC FTSE EPRA NAREIT Dev ETF USD (Acc)"   "FS00008NTM" -1 "150,160,539.30"        6.3 1
                  "HSBC MSCI Russia Capped ETF"               "FS00008NTO" 10 "109,564,435.20"       8.61 1
                  "Global Diversification Fund FI"            "FS00008NXB" 10 "4,729,779.00"          -.4 1
                  "BGF Emerging Markets Eq Inc A2"            "FS00008O3B" 10 "33,485,106.45"        2.13 1
                  "Dutch Darlings Fund"                       "FS00008O49" 13 "18,733,856.00"        6.71 1
                  "TT Emerging Markets Equity C2 EUR Acc"     "FS00008OBH"  4 "3,921,421.30"         -1.4 1
                  "Nuveen Global Clean Infras Imp A EUR Acc"  "FS00008OFJ" 10 "35,835.00"            8.18 1
                  "Mirova Europe Sust Eq I/C EUR"             "FS00008OGO"  9 "8,344,810.00"         4.79 1
                  "HSBC MSCI Emerg Mkts ETF"                  "FS00008ORA"  9 "850,667,978.50"       1.77 1
                  "Invesco Global Equity Income A EUR Acc"    "FS00008P06" -2 "3,393,928.11"         5.45 1
                  "Invesco Dev Sm and MidCap Eq A EURHAcc"    "FS00008P07"  9 "3,985,001.00"         1.74 1
                  "Invesco US Value Equity E EUR Acc"         "FS00008P0O"  9 "55,353,881.18"        9.58 1
                  "Invesco Rspnb Jpn Eq Val Discv A EUR Acc"  "FS00008P0P"  0 "11,540.24"            5.07 1
                  "Invesco Japanese Eq Adv A Ann EURH Inc"    "FS00008P0Q"  2 "48,013,863.00"           5 1
                  "Mercer Low Volatility Eq A1 H 0.0200 EUR"  "FS00008P4X"  1 "1,786,815.87"         4.56 1
                  "Didner & Gerge Global"                     "FS00008R0F"  9 "596,043,057.30"       7.05 1
                  "Aktia Europe Small Cap K"                  "FS00008R0Q" -1 "4,995,783.00"      3.34855 1
                  "Arc Actions Rendement"                     "FS00008R14"  9 "16,173,000.00"        6.53 1
                  "Dorval Manageurs Europe I C"               "FS00008R5D" 10 "108,057,539.00"          6 1
                  "JPM Euroland Dynamic A perf (acc) EUR"     "FS00008R6R"  9 "47,900,591.00"        7.89 1
                  "Comgest Growth Europe S EUR S Acc"         "FS00008R7W" 10 "20,082,244.00"        6.04 1
                  "Robeco BP US Select Opports Eqs D €"     "FS00008R9A"  7 "174,206,852.70"          9 1
                  "JOHCM Asia ex-Japan A EUR Inc"             "FS00008TMC"  9 "3,017,773.00"         1.55 1
                  "JOHCM Asia ex-Japan Sm & Md-Cp A € I"    "FS00008TMD"  9 "592,092.90"          -1.27 1
                  "UBS(Lux)FS MSCI EMU SRI EUR Aacc"          "FS00008VBC"  3 "24,197,175.92"        4.55 1
                  "UBS(Lux)FS MSCI USA SRI EURH Adis"         "FS00008VBD"  5 "16,093,119.65"        4.35 1
                  "UBS(Lux)FS MSCI Pacific SRI USD Aacc"      "FS00008VBE"  1 "10,992,309.83"        3.62 1
                  "UBS(Lux)FS MSCI World SRI USD Aacc"        "FS00008VBF"  3 "611,970,067.70"       6.77 1
                  "Nordea 1 - Stable Emerg Mkts Eq AX EUR"    "FS00008VJB"  6 "22,580,521.00"        4.29 1
                  "HANSAsmart Select E A"                     "FS00008WGQ"  9 "97,659,000.00"        4.31 1
                  "CT (Lux) US Contr Core Equities AEC"       "FS00008XAW"  5 "5,460.00"              3.9 1
                  "Polar Capital North American I"            "FS00008XB6"  9 "450,854,207.60"       8.05 1
                  "Xtrackers MSCI Pakistan Swap ETF 1C"       "FS00008XHT"  9 "11,305,342.65"        2.88 1
                  "Xtrackers MSCI Singapore ETF 1C"           "FS00008XHU"  9 "35,202,941.97"        7.88 1
                  "Fidelity FAST Emerging Markets A-ACC-EUR"  "FS00008XZD"  6 "1,014,034.00"         1.44 1
                  "Espiria SDG Solutions A"                   "FS00008XZS"  6 "9,136,402.27"         5.19 1
                  "Enh Index Sust EQ Fund NL-T"               "FS00008YXT" 10 "371,810,637.00"       6.44 1
                  "iShares Gold Producers ETF USD Acc"        "FS00008ZAJ"  9 "1,504,549,455.00"      7.3 1
                  "iShares Oil & Gas Explr&Prod ETF USD Acc"  "FS00008ZAK"  9 "169,210,465.80"       8.57 1
                  "iShares Agribusiness ETF USD Acc"          "FS00008ZAL"  9 "175,759,799.50"       7.08 1
                  "Apus Capital Revalue Fonds I"              "FS00008ZCK"  4 "21,421,000.00"       -5.31 1
                  "Core Series - Core Emg Mkts Eq B EUR ND"   "FS00008ZGB"  1 "2,020,314.00"          .02 1
                  "SPDR® S&P US Dividend Aristocrats ETFDis" "FS00008ZI5"  9 "2,492,569,237.00"    10.64 1
                  "SPDR S&P EmMks Dividend Aristocrats ETF"   "FS00008ZI6"  9 "128,106,951.40"       4.12 1
                  "Tocqueville Value Euro ISR GP"             "FS00008ZID"  4 "51,027,960.00"        7.55 1
                  "Groupama Europe Actions Immobilier G"      "FS00008ZLW" 10 "8,288,000.00"         3.42 1
                  "Nykredit Invest Globale Fokusaktier KL"    "FS00008ZNF"  9 "201,330,699.20"       5.71 1
                  "Nykredit Invest Bæredygtige Aktier KL"    "FS00008ZNH"  9 "515,190,252.50"       6.22 1
                  "iShares MSCI ACWI ETF USD Acc"             "FS00008ZQ1"  9 "1,766,443,332.00"     6.05 1
                  "Amundi Russell 1000 Growth ETF Acc"        "FS00008ZQ6"  9 "162,997,909.70"       5.03 1
                  "Wealth Invest Strategi Aktier"             "FS0000900P" 10 "50,406,422.11"        4.74 1
                  "DB Advisors Emerg Mkts Eqs Passv ID EUR"   "FS0000901L"  9 "260,794,851.00"       -.48 1
                  "LO Funds Emerging High Convict SH EUR IA"  "FS0000902S"  4 "6,504,401.32"        -3.28 1
                  "Spiltan Aktiefond Investmentbolag"         "FS0000903F"  9 "2,378,392,105.00"     9.91 1
                  "Ofi Invest France Equity I"                "FS0000905L"  6 "8,077,000.00"          7.2 1
                  "Carnegie Fastighetsfond Norden A"          "FS000090DJ"  9 "195,156,301.40"       4.42 1
                  "Nordea 1 - Global Real Estate BC EUR"      "FS000090GG"  3 "822.77"               5.69 1
                  "AB Select US Equity A EUR"                 "FS0000913J"  9 "9,335,473.92"     6.632853 1
                  "Abacus Tech For Good I"                    "FS0000914T"  9 "8,165,000.00"      6.38704 1
                  "Robeco Instl Emerging Markets Fund"        "FS00009159" 27 "1,261,555,945.00"     2.54 1
                  "Robeco QI Instl EM Enhanced Index Eqs Fd"  "FS0000915A" 14 "1,820,868,366.00"     2.95 1
                  "Nuveen Global Dividend Growth A Eur Acc"   "FS0000917U"  9 "158,004.00"           8.13 1
                  "Amundi Fds EurEq Sust Inc A2 EUR C"        "FS000091CZ"  7 "1,371,179.89"         9.58 1
                  "Ninety One GSF Glb Value Eq A Acc EURH"    "FS000091F6"  1 "11,370.00"            3.57 1
                  "Espiria Global A"                          "FS000091GV"  8 "100,091,856.90"       6.59 1
                  end

                  Comment


                  • #10
                    Your example data contains only treatment cases, no controls. Please post back with a data example that includes both cases and controls.

                    Also please clarify what you are asking for help with. Is it the calculation of the propensity score? Is it the formation of the matched pairs--if so, once you calculate the propensity score, it works almost the same way as the code you show in #12. The only difference is that local match_vars will be empty, and where it says -joinby `match_vars' using `controls'- you have to replace that with -cross using `controls'-.

                    Regarding your variable net_asset, Stata holds it as a string because it contains commas. Stata will only accept as numeric a series of digits, optionally preceded by a sign, and optionally containing one decimal point, and optionally containing one exponent. Commas are not permitted. The simplest way to resolve this is:
                    Code:
                    destring net_asset, ignore(",") replace
                    Assuming that -destring- issues no error messages from that, then save this as a new data set, and use the new one from now on. (Archive the old one for reference.)

                    If -destring- issues an error message, then there is some additional problem with the data, one that is not evident in your example. You can find the observations causing the trouble easily enough by running -list net_asset if missing(real(net_asset)) & !missing(net_asset)-. Then you have to scan that output to see what is wrong with those values of net_asset and either fix them, if possible, or delete them if not.

                    Comment


                    • #11
                      Originally posted by Clyde Schechter View Post
                      Your example data contains only treatment cases, no controls. Please post back with a data example that includes both cases and controls.

                      Also please clarify what you are asking for help with. Is it the calculation of the propensity score? Is it the formation of the matched pairs--if so, once you calculate the propensity score, it works almost the same way as the code you show in #12. The only difference is that local match_vars will be empty, and where it says -joinby `match_vars' using `controls'- you have to replace that with -cross using `controls'-.

                      Regarding your variable net_asset, Stata holds it as a string because it contains commas. Stata will only accept as numeric a series of digits, optionally preceded by a sign, and optionally containing one decimal point, and optionally containing one exponent. Commas are not permitted. The simplest way to resolve this is:
                      Code:
                      destring net_asset, ignore(",") replace
                      Assuming that -destring- issues no error messages from that, then save this as a new data set, and use the new one from now on. (Archive the old one for reference.)

                      If -destring- issues an error message, then there is some additional problem with the data, one that is not evident in your example. You can find the observations causing the trouble easily enough by running -list net_asset if missing(real(net_asset)) & !missing(net_asset)-. Then you have to scan that output to see what is wrong with those values of net_asset and either fix them, if possible, or delete them if not.
                      Thank you very much for your guide. Yes I was confused with the local match_vars and -joinby. However, I have solved it thanks to your explanation. Also, I have followed the destring code and it works.

                      Thank you for spending your time replying my question. I do appreciate your support. Hope you enjoy the rest of the day.

                      Comment


                      • #12
                        Dear Clyde,
                        Regarding your answer to the first question and following the same example, could you please guide me on how to decide between:
                        1. Estimating the propensity score on the pooled sample, then matching on the exact country and industry (as in the initial question).
                        2. Running the full matching procedure (both PS and matching) separately for each country and industry.
                        3. Adding dummies for industry and country in the PS calculation and then match on the pooled sample.
                        I understand that if a variable strongly influences participation, the second approach might be necessary. I am analysing the effect of finding a job through an employment agency on salaries, but wages are influenced by the economic cycle (quarter of the year) and the gender, I’m unsure of the best way to incorporate these two variables in the matching process. Could you advise please? Thank you!

                        Comment


                        • #13
                          This is a very good question, one I've never run across before. As I think about it, the three approaches differ only in the way that the propensity score is calculated. In 1, the propensity score is unaware of country and industry. In 2, we go to the opposite extreme: the propensity model itself is country#industry specific. In 3, we have an intermediate approach in which i.industry and i.country figure in the propensity score calculation--so there is some adjustment of PS for country and industry, but it is modestly done.

                          I don't grasp what you are saying in "I understand that if a variable strongly influences participation, the second approach might be necessary." And I can't find anything in my understanding that says that strength of influence on participation is important in this context. But perhaps I'm just missing something in this regard.

                          Here's my intuition on how to approach this trilemma. Propensity score matching's effectiveness depends on the estimated PS being a good approximation to the actual probability of being in the group exposed to the intervention/exposure. So I think it's a question of which PS model is the most valid model of the real data generating process. On the one hand, there is likely to be value in representing country and industry in the propensity score if these are indeed relevant to the probability of being in the exposed group. On the other hand, including them as indicator variables adds a large number of degrees of freedom to the model and risks overfitting the noise in the data. And allowing a tailored propensity model for each country#industry group of observations explodes that risk of overfitting to an extreme level.

                          So what I might do is first try fitting all three PS models: one without country and industry using the entire sample, one with indicators for industry and country, and then another where each country#industry group gets its own PS model (which can be done by including the interaction of industry#country with all of the other variables in the propensity model). Then look at AIC or BIC to get a sense of which of the models gives the best tradeoff between fitting and overfitting.

                          Comment


                          • #14
                            Dear Clyde,

                            Thank you for your response.

                            Could you clarify the difference between the second approach, "with indicators for industry and country," and the third, "including the interaction of industry#country with other variables"? Unless you mean including -i.industry and i.country separately for the second option?

                            In any case, I tried the third approach by adding the interaction of industry#country as a covariate in the PS, but I’m still seeing matches across different industries and countries. I guess I need to split the samples by industry#country before calculating the propensity score.

                            I was citing "Some Practical Guidance for the Implementation of Propensity Score Matching" by Caliendo & Kopeinig, when referring to variables that strongly influence participation. They indicate that running the PS on specific sub-sambles is "especially recommendable if one expects the effects to be heterogeneous between certain groups"

                            Thanks again for your help!

                            Comment


                            • #15
                              The third approach looks like this:
                              Code:
                              logit exposed_group relevant_variables i.country i.industry
                              The second approach looks like this:
                              Code:
                              logit exposed group i.country##i.industry##(relevant_variables)
                              (Don't forget to use appropriaate c. and i. prefixes for the relevant_variables.)

                              They indicate that running the PS on specific sub-sambles is "especially recommendable if one expects the effects to be heterogeneous between certain groups"
                              Yes, that makes sense. That is different from "if a variable strongly influences participation."



                              Comment

                              Working...
                              X