Finding matches based on exact country and industry criteria,and the closest propensity score

Ann Ngo

Join Date: Jun 2023

Posts: 13
#1

Finding matches based on exact country and industry criteria,and the closest propensity score

09 Jun 2023, 12:30

Dear Statalist,

I'm currently working on my master dissertation, and I'm facing difficulties in finding the appropriate codes for what seems like a relatively simple problem. Given the pressing nature of my project and the limited time available, I kindly request the assistance of anyone who has experience in this topic. Your support would be very meaningful to me.

In my dataset, I have 1,141 treated firms and 37,000 control firms. I aim to match each treated firm with a control firm based on an exact match of country and industry codes, as well as the nearest propensity score (which I have already computed). My desired outcome is a table that displays the matched pairs. Despite searching through the forum and attempting different codes, I have been unsuccessful in achieving this. I would greatly appreciate any assistance with this matter.

Regarding the country and industry codes, I am unsure whether they should be set as "string" or "numeric" data types. I'm also curious about the potential impact on the execution of the code.

If I want to consider two scenarios: (1) one control firm for each treated firm; and (2) one control firm can be matched with multiple treated firms, what should the codes be for these two particular scenarios?

Thank you in advance for any help you can provide.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#2

09 Jun 2023, 12:38

I doubt anyone can provide more than vague, general advice without having example data to work with.

Please post back, using the -dataex- command to do show that. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

In choosing the example data to show, please be sure to include all of the variables needed for the matching. Also include both some treated and some control firms, and be sure that some of the included ones are potential matches.

Regarding the data types of the country and industry codes, it does not matter for purposes of code development which you choose--the code will be the same. The use of numeric variables would be somewhat faster in a very large data set, but I doubt it would make a noticeable difference in one of the size you describe.
Comment

Ann Ngo

Join Date: Jun 2023
Posts: 13

09 Jun 2023, 14:27

Hi Mr. Schechter,

I am so glad to get a reply from you. Thank you very much for your guide.

Following is an illustration of my dataframe. I have ISIN, treatment (=1 for treated firm and 0 for control firm), country code, industry code, market capitalisation, and price-to-book value. I am aiming to match treated firms and control firms in pairs on exact country and industry, and the nearest market_cap and price_to_book. I am supposed to compute propensity scores using market_cap and price_to_book, then match on the nearest propensity scores. Could you please help me with this?

Code:

input byte id str12 isin byte treatment str2 country byte industry float(market_cap price_to_book)
 1 "DE0007664005" 1 "DE" 29   43540.5   .244
 2 "DE000UNSE018" 1 "DE" 35   951.496   .215
 3 "DE0007100000" 1 "DE" 29   65805.7    .76
 4 "IT0003128367" 1 "IT" 35   51138.4  1.215
 5 "IT0003132476" 1 "IT"  6  47903.93   .867
 6 "DE0005557508" 1 "DE" 61   93236.8  1.068
 7 "DE000ENAG999" 1 "DE" 35 24664.637  1.128
 8 "DE0005552004" 1 "DE" 53  43794.55  1.848
 9 "DE0008404005" 1 "DE" 65  81287.94  1.579
10 "DE000BASF111" 1 "DE" 20  41541.91  1.015
12 "FR0010208488" 1 "FR" 35 32603.596    .83
14 "FR0010208488" 1 "FR" 35 32603.596    .83
16 "FR0014003U94" 1 "FR" 45   354.671  1.683
18 "FR0000073272" 1 "FR" 30  49953.41  4.597
20 "FR0000035164" 1 "FR" 30    1135.5    1.8
22 "IT0005037210" 1 "IT" 70  1076.322  2.677
24 "DE0005810055" 1 "DE" 66     31027  3.424
26 "DE000A0ETBQ4" 1 "DE" 66   541.202   .703
28 "DE0005493365" 1 "DE" 66   639.922  2.346
30 "DE000FTG1111" 1 "DE" 66   703.092  1.156
32 "IT0003097257" 1 "IT" 28     347.5    1.3
34 "GB00BYM8GJ06" 1 "GB" 73  1004.122   1.18
36 "GB00BGDT3G23" 1 "GB" 73  4774.876 61.913
38 "GB00B5NR1S72" 1 "GB" 74   511.243  1.654
40 "GB00B61TVQ02" 1 "GB" 45  3474.502   1.96
42 "IT0005452658" 0 "IT" 70     572.2    5.5
44 "NL0015000N33" 0 "IT" 70  1003.061   .991
46 "FR0010220475" 0 "FR" 30  9541.772  1.048
48 "FR0000032278" 0 "FR" 30     196.5    1.7
50 "FR0014007LQ2" 0 "FR" 45  44.93958  1.969
52 "FR0013030152" 0 "FR" 35 264.84802  4.198
54 "FR0012532810" 0 "FR" 35       643    5.2
56 "DE0006095003" 0 "DE" 66  2997.577  3.133
58 "DE000A161077" 0 "DE" 66 142.38539   .686
60 "DE000A0B9N37" 0 "DE" 66  228.9467  6.193
62 "DE000A2GSU42" 0 "DE" 66 239.75325    .76
64 "DE0005408686" 0 "DE" 66     377.9      2
66 "DE0008148206" 0 "DE" 66 103.45737   .395
68 "IT0001237053" 0 "IT" 28 191.14803    .69
70 "IT0005107492" 0 "IT" 28 223.59818  1.057
72 "JE00B8KF9B49" 0 "GB" 73  9938.426  2.111
74 "IM00BQ8NYV14" 0 "GB" 73 1316.8152  1.197
76 "GB00B01F7T14" 0 "GB" 73 208.87483  4.834
78 "GB00BDVZYZ77" 0 "GB" 74    2108.1     .4
80 "GB00B19NLV48" 0 "GB" 74   27868.4    8.8
82 "GB0009697037" 0 "GB" 74    1601.4    2.1
84 "GB00BLGXWY71" 0 "GB" 74  618.8959  1.955
86 "GB00BYQB9V88" 0 "GB" 45   786.445  1.162
88 "GB00BD0SFR60" 0 "GB" 45     183.5    3.2
90 "GB00BVYVFW23" 0 "GB" 45    5523.5    9.3

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#4

09 Jun 2023, 15:14

Thanks for the good example data.

To start with, we match up every case (treated) with all controls that are exact matches on country and industry. Then we calculate the absolute difference in propensity score, and sort the data so that the first control for each case is the one with the closest propensity score. (If there are two or more controls tied for closest propensity score, the tie is broken by a random choice.)

Code:

// VERIFY ID IS AN IDENTIFIER isid id // CALCULATE PROPENSITY SCORE assert inlist(treatment, 0, 1) logistic treatment market_cap price_to_book predict pscore // MATCHING local match_vars country industry ds treatment `match_vars', not local non_match_vars `r(varlist)' preserve keep if !treatment rename (`non_match_vars') =_ctrl drop treatment tempfile controls save `controls' restore keep if treatment rename (`non_match_vars') =_case drop treatment joinby `match_vars' using `controls', // NOW SELECT CLOSEST PROPENSITYSCORE MATCH, BREAKING TIES AT RANDOM BUT REPRODUCIBLY gen delta = abs(pscore_case - pscore_ctrl) set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE gen double shuffle = runiform()

From this point, the code differs in the two scenarios you outlined in #1. The following shows the way to do your second scenario:

Code:

// THIS WAY ALLOWS THE SAME CONTROL TO MATCH TO MULTPLE TREATED CASES by id_case (delta shuffle), sort: keep if _n == 1

And the following shows the way you would do it for your first scenario.

Code:

// THIS WAY RESTRICTS CONTROLS TO MATCHING ONLY ONE CASE sort id_case delta shuffle local i = 1 while `i' < _N { drop if id_case == id_case[`i'] & _n > `i' drop if id_ctrl == id_ctrl[`i'] & _n > `i' local ++i }

The end result is a set of matched pairs that agree on country and industry and are, as closely as possible, matched on propensity score. (Any cases or controls that found no match will have been eliminated.)

As between the two scenarios, allowing controls to match only one case seems to be very popular, I suppose for aesthetic reasons. But the statistical reality is that this restriction offers no advantages and has an important drawback. When you restrict each control to being only used once, it may be that some case that has multiple potential matches draws away the only good control for some other case. That other case may be either left unmatched altogether, or end up with a substantially inferior match. (This doesn't actually happen in your example data, but that doesn't mean it won't in your full data set.)
1 like
Comment
Ann Ngo

Join Date: Jun 2023

Posts: 13
#5

09 Jun 2023, 17:40

Thank you very much for your detailed instructions. I have successfully matched the firms by both ways following your codes and that is a big step completed before further analysis. You assistant means a lot to me. I hope you have a great weekend.
Comment
Ann Ngo

Join Date: Jun 2023

Posts: 13
#6

10 Jul 2023, 11:22

Originally posted by Clyde Schechter View Post

Thanks for the good example data.

To start with, we match up every case (treated) with all controls that are exact matches on country and industry. Then we calculate the absolute difference in propensity score, and sort the data so that the first control for each case is the one with the closest propensity score. (If there are two or more controls tied for closest propensity score, the tie is broken by a random choice.)

Code:

// VERIFY ID IS AN IDENTIFIER isid id // CALCULATE PROPENSITY SCORE assert inlist(treatment, 0, 1) logistic treatment market_cap price_to_book predict pscore // MATCHING local match_vars country industry ds treatment `match_vars', not local non_match_vars `r(varlist)' preserve keep if !treatment rename (`non_match_vars') =_ctrl drop treatment tempfile controls save `controls' restore keep if treatment rename (`non_match_vars') =_case drop treatment joinby `match_vars' using `controls', // NOW SELECT CLOSEST PROPENSITYSCORE MATCH, BREAKING TIES AT RANDOM BUT REPRODUCIBLY gen delta = abs(pscore_case - pscore_ctrl) set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE gen double shuffle = runiform()

From this point, the code differs in the two scenarios you outlined in #1. The following shows the way to do your second scenario:

Code:

// THIS WAY ALLOWS THE SAME CONTROL TO MATCH TO MULTPLE TREATED CASES by id_case (delta shuffle), sort: keep if _n == 1

And the following shows the way you would do it for your first scenario.

Code:

// THIS WAY RESTRICTS CONTROLS TO MATCHING ONLY ONE CASE sort id_case delta shuffle local i = 1 while `i' < _N { drop if id_case == id_case[`i'] & _n > `i' drop if id_ctrl == id_ctrl[`i'] & _n > `i' local ++i }

The end result is a set of matched pairs that agree on country and industry and are, as closely as possible, matched on propensity score. (Any cases or controls that found no match will have been eliminated.)

As between the two scenarios, allowing controls to match only one case seems to be very popular, I suppose for aesthetic reasons. But the statistical reality is that this restriction offers no advantages and has an important drawback. When you restrict each control to being only used once, it may be that some case that has multiple potential matches draws away the only good control for some other case. That other case may be either left unmatched altogether, or end up with a substantially inferior match. (This doesn't actually happen in your example data, but that doesn't mean it won't in your full data set.)

Dear Clyde,

I appreciate your previous assistance with propensity score matching. Could I (again) kindly request your help with the codes for entropy balancing using the same database? I have searched around and learned how to reweigh the variables, but I am uncertain about the subsequent steps to obtain pairs of firms similar to those obtained through propensity score matching. Additionally, I am unsure about how to define the conditions for firms in each pair to have the same country and industry before reweighting the 02 variables: market capitalisation and price-to-book ratio. Thank you very much and I really hope to have your support on this matter.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#7

10 Jul 2023, 11:31

Sorry to disappoint, but I don't know anything about this technique. Hopefully somebody else following this thread who does will chime in. If nobody provides a helpful response within, say, 24 hours, I suggest you repost as a new thread.
Comment
Ann Ngo

Join Date: Jun 2023

Posts: 13
#8

10 Jul 2023, 11:50

Originally posted by Clyde Schechter View Post

Sorry to disappoint, but I don't know anything about this technique. Hopefully somebody else following this thread who does will chime in. If nobody provides a helpful response within, say, 24 hours, I suggest you repost as a new thread.

Thanks a lot for getting back to me. I would wait to see if someone knows about it. Wish you a pleasant day
Comment

Ann Ngo

Join Date: Jun 2023
Posts: 13

13 Feb 2024, 19:32

Originally posted by Clyde Schechter View Post

Thanks for the good example data.

To start with, we match up every case (treated) with all controls that are exact matches on country and industry. Then we calculate the absolute difference in propensity score, and sort the data so that the first control for each case is the one with the closest propensity score. (If there are two or more controls tied for closest propensity score, the tie is broken by a random choice.)

Code:

// VERIFY ID IS AN IDENTIFIER
isid id

// CALCULATE PROPENSITY SCORE
assert inlist(treatment, 0, 1)
logistic treatment market_cap price_to_book
predict pscore

// MATCHING
local match_vars country industry
ds treatment `match_vars', not
local non_match_vars `r(varlist)'

preserve
keep if !treatment
rename (`non_match_vars') =_ctrl
drop treatment
tempfile controls
save `controls'

restore
keep if treatment
rename (`non_match_vars') =_case
drop treatment
joinby `match_vars' using `controls',

// NOW SELECT CLOSEST PROPENSITYSCORE MATCH, BREAKING TIES AT RANDOM BUT REPRODUCIBLY
gen delta = abs(pscore_case - pscore_ctrl)
set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE
gen double shuffle = runiform()

From this point, the code differs in the two scenarios you outlined in #1. The following shows the way to do your second scenario:

Code:

// THIS WAY ALLOWS THE SAME CONTROL TO MATCH TO MULTPLE TREATED CASES
by id_case (delta shuffle), sort: keep if _n == 1

And the following shows the way you would do it for your first scenario.

Code:

// THIS WAY RESTRICTS CONTROLS TO MATCHING ONLY ONE CASE
sort id_case delta shuffle
local i = 1
while `i' < _N {
drop if id_case == id_case[`i'] & _n > `i'
drop if id_ctrl == id_ctrl[`i'] & _n > `i'
local ++i
}

The end result is a set of matched pairs that agree on country and industry and are, as closely as possible, matched on propensity score. (Any cases or controls that found no match will have been eliminated.)

As between the two scenarios, allowing controls to match only one case seems to be very popular, I suppose for aesthetic reasons. But the statistical reality is that this restriction offers no advantages and has an important drawback. When you restrict each control to being only used once, it may be that some case that has multiple potential matches draws away the only good control for some other case. That other case may be either left unmatched altogether, or end up with a substantially inferior match. (This doesn't actually happen in your example data, but that doesn't mean it won't in your full data set.)

Dear Clyde Schechter,

I would be grateful if you could take some time once more to help me with propensity score matching. I am working on another dataset that requires propensity score matching, which is a bit different from the previous one.

In my dataset, I have treated and control funds. The propensity score is computed based on fund_age, net_asset, and return. The treatment values take 1 for treated funds and 0 for control funds. My desired outcome is a table that displays the matched pairs. If possible, could you please help me with the codes to match: (1) one control fund for each treated fund; and (2) one control fund matched with multiple treated funds?

I have attached a part of my dataset below.

Also, I would like to add a small question. My variable net_asset is read as string type by Stata, which is strange because it is supposed to be numeric. I have tried to change to data format but it did not work. So, I have to encode it into numeric every time I use this database. Could you please share your experience on this issue?

Thank you very much for your time reading this post. Your assistance is very meaningful to me.

Code:

input str44 name str10 fund_id byte fund_age str18 net_asset float return byte treatment
"Harris Associates Kokusai S/A USD"         "FS00008KNR" 10 "23,539,027.31"        6.84 1
"BL-Equities Japan B EUR Hedged"            "FS00008KQC"  8 "2,114,354.39"         5.33 1
"ODDO BHF Emerging ConDmd CIW EUR Acc"      "FS00008KRV" 10 "154,587,573.00"        .98 1
"SPDR® MSCI ACWI ETF"                      "FS00008KT6" 10 "2,166,734,385.00"      6.1 1
"SPDR® MSCI ACWI IMI UCITS ETF"            "FS00008KT7" 10 "283,345,104.80"        6.2 1
"SPDR® MSCI EM Asia ETF"                   "FS00008KT8" 10 "1,267,964,232.00"      .19 1
"SPDR® MSCI Emerging Markets ETF"          "FS00008KTB" 10 "457,344,740.70"       1.78 1
"SPDR® MSCI Emerging Markets SmallCap ETF" "FS00008KTC" 10 "135,820,252.10"       4.78 1
"Invesco Pan European Focus Eq A EUR Acc"   "FS00008L1N" 10 "7,556,821.00"         8.45 1
"Neuberger Berman US Sm Cap EUR A Acc"      "FS00008L2Q"  0 "729,826.12"           1.45 1
"UBS FS MSCI Emerg Mkts SF USD A acc ETF"   "FS00008LJO" 10 "481,120,764.60"       1.69 1
"Pictet-China Index I EUR"                  "FS00008LK2"  3 "93,395,833.00"       -3.26 1
"UBS FS S&P 500 SF USD A acc ETF"           "FS00008LMV" 10 "126,892,677.00"       7.77 1
"Vontobel Fd II mtxEmMktsSstbyChampNGEUR"   "FS00008LOW" -1 "158,589,108.70"       1.54 1
"Sextant Tech A"                            "FS00008LRJ" 10 "8,745,000.00"         5.37 1
"Templeton European Div A(acc)EUR"          "FS00008MD9" 10 "8,541,244.00"         3.25 1
"Mirabaud Eqs Swiss Sm & Mid I EUR Acc"     "FS00008MDY"  6 "187,499,218.50"       4.83 1
"iShares Europe Index (IE) D Acc EUR"       "FS00008MO0"  4 "4,497,979.00"         6.52 1
"Abeille Capital Planète"                  "FS00008MVC" 10 "9,429,000.00"       1.6996 1
"TA-ITA Azioni"                             "FS00008MVH" 13 "73,915,000.00"        7.65 1
"Pharus SICAV EOS A1 EUR Acc"               "FS00008N1A" 10 "12,480,828.00"         8.5 1
"Portfolio Wachstum ZKB Oe I T"             "FS00008N1D"  5 "19,544,953.00"        1.44 1
"Portfolio Wachstum (Euro) Alt ZKB Oe I T"  "FS00008N1E"  5 "31,397,515.00"        1.24 1
"LBPAM ISR Actions Emergents L"             "FS00008N2O" -1 "40,524,000.00"        1.18 1
"GAM Sustainable Emerg Eq EUR Acc"          "FS00008N8J" 10 "2,300,000.00"         1.71 1
"GAM Star Capital Apprec US Eq GBP Acc"     "FS00008NDX"  2 "76,544.80"            6.75 1
"William Blair EM Leaders D USD Acc"        "FS00008NEV" 10 "4,280,819.49"          -.7 1
"MFS Meridian Blnd Rsrch Eurp Eq A1 EUR"    "FS00008NHD" 10 "3,584,479.00"         7.64 1
"Mirae Asset ESG Asia Grt Cnsmr Eq A EUR"   "FS00008NMU"  8 "3,045,012.38"        -5.23 1
"ACATIS Global Value Total Return"          "FS00008NOQ" 10 "45,163,062.00"     5.49962 1
"Norron Active RC SEK"                      "FS00008NOT" 10 "16,404.07"            5.34 1
"Sands Capital Global Growth A EUR Acc"     "FS00008NQV"  6 "25,815,530.00"        -.13 1
"Artisan Global Value I EUR Acc"            "FS00008NS0"  5 "15,915,930.33"        9.13 1
"Handelsbanken USA Ind Crit A1 EUR"         "FS00008NSF"  6 "183,611,600.00"       6.74 1
"Handelsbanken Sverige 100 Ind Cri A1 SEK"  "FS00008NSG" 10 "670,488,902.30"       7.81 1
"HSBC FTSE EPRA NAREIT Dev ETF USD (Acc)"   "FS00008NTM" -1 "150,160,539.30"        6.3 1
"HSBC MSCI Russia Capped ETF"               "FS00008NTO" 10 "109,564,435.20"       8.61 1
"Global Diversification Fund FI"            "FS00008NXB" 10 "4,729,779.00"          -.4 1
"BGF Emerging Markets Eq Inc A2"            "FS00008O3B" 10 "33,485,106.45"        2.13 1
"Dutch Darlings Fund"                       "FS00008O49" 13 "18,733,856.00"        6.71 1
"TT Emerging Markets Equity C2 EUR Acc"     "FS00008OBH"  4 "3,921,421.30"         -1.4 1
"Nuveen Global Clean Infras Imp A EUR Acc"  "FS00008OFJ" 10 "35,835.00"            8.18 1
"Mirova Europe Sust Eq I/C EUR"             "FS00008OGO"  9 "8,344,810.00"         4.79 1
"HSBC MSCI Emerg Mkts ETF"                  "FS00008ORA"  9 "850,667,978.50"       1.77 1
"Invesco Global Equity Income A EUR Acc"    "FS00008P06" -2 "3,393,928.11"         5.45 1
"Invesco Dev Sm and MidCap Eq A EURHAcc"    "FS00008P07"  9 "3,985,001.00"         1.74 1
"Invesco US Value Equity E EUR Acc"         "FS00008P0O"  9 "55,353,881.18"        9.58 1
"Invesco Rspnb Jpn Eq Val Discv A EUR Acc"  "FS00008P0P"  0 "11,540.24"            5.07 1
"Invesco Japanese Eq Adv A Ann EURH Inc"    "FS00008P0Q"  2 "48,013,863.00"           5 1
"Mercer Low Volatility Eq A1 H 0.0200 EUR"  "FS00008P4X"  1 "1,786,815.87"         4.56 1
"Didner & Gerge Global"                     "FS00008R0F"  9 "596,043,057.30"       7.05 1
"Aktia Europe Small Cap K"                  "FS00008R0Q" -1 "4,995,783.00"      3.34855 1
"Arc Actions Rendement"                     "FS00008R14"  9 "16,173,000.00"        6.53 1
"Dorval Manageurs Europe I C"               "FS00008R5D" 10 "108,057,539.00"          6 1
"JPM Euroland Dynamic A perf (acc) EUR"     "FS00008R6R"  9 "47,900,591.00"        7.89 1
"Comgest Growth Europe S EUR S Acc"         "FS00008R7W" 10 "20,082,244.00"        6.04 1
"Robeco BP US Select Opports Eqs D €"     "FS00008R9A"  7 "174,206,852.70"          9 1
"JOHCM Asia ex-Japan A EUR Inc"             "FS00008TMC"  9 "3,017,773.00"         1.55 1
"JOHCM Asia ex-Japan Sm & Md-Cp A € I"    "FS00008TMD"  9 "592,092.90"          -1.27 1
"UBS(Lux)FS MSCI EMU SRI EUR Aacc"          "FS00008VBC"  3 "24,197,175.92"        4.55 1
"UBS(Lux)FS MSCI USA SRI EURH Adis"         "FS00008VBD"  5 "16,093,119.65"        4.35 1
"UBS(Lux)FS MSCI Pacific SRI USD Aacc"      "FS00008VBE"  1 "10,992,309.83"        3.62 1
"UBS(Lux)FS MSCI World SRI USD Aacc"        "FS00008VBF"  3 "611,970,067.70"       6.77 1
"Nordea 1 - Stable Emerg Mkts Eq AX EUR"    "FS00008VJB"  6 "22,580,521.00"        4.29 1
"HANSAsmart Select E A"                     "FS00008WGQ"  9 "97,659,000.00"        4.31 1
"CT (Lux) US Contr Core Equities AEC"       "FS00008XAW"  5 "5,460.00"              3.9 1
"Polar Capital North American I"            "FS00008XB6"  9 "450,854,207.60"       8.05 1
"Xtrackers MSCI Pakistan Swap ETF 1C"       "FS00008XHT"  9 "11,305,342.65"        2.88 1
"Xtrackers MSCI Singapore ETF 1C"           "FS00008XHU"  9 "35,202,941.97"        7.88 1
"Fidelity FAST Emerging Markets A-ACC-EUR"  "FS00008XZD"  6 "1,014,034.00"         1.44 1
"Espiria SDG Solutions A"                   "FS00008XZS"  6 "9,136,402.27"         5.19 1
"Enh Index Sust EQ Fund NL-T"               "FS00008YXT" 10 "371,810,637.00"       6.44 1
"iShares Gold Producers ETF USD Acc"        "FS00008ZAJ"  9 "1,504,549,455.00"      7.3 1
"iShares Oil & Gas Explr&Prod ETF USD Acc"  "FS00008ZAK"  9 "169,210,465.80"       8.57 1
"iShares Agribusiness ETF USD Acc"          "FS00008ZAL"  9 "175,759,799.50"       7.08 1
"Apus Capital Revalue Fonds I"              "FS00008ZCK"  4 "21,421,000.00"       -5.31 1
"Core Series - Core Emg Mkts Eq B EUR ND"   "FS00008ZGB"  1 "2,020,314.00"          .02 1
"SPDR® S&P US Dividend Aristocrats ETFDis" "FS00008ZI5"  9 "2,492,569,237.00"    10.64 1
"SPDR S&P EmMks Dividend Aristocrats ETF"   "FS00008ZI6"  9 "128,106,951.40"       4.12 1
"Tocqueville Value Euro ISR GP"             "FS00008ZID"  4 "51,027,960.00"        7.55 1
"Groupama Europe Actions Immobilier G"      "FS00008ZLW" 10 "8,288,000.00"         3.42 1
"Nykredit Invest Globale Fokusaktier KL"    "FS00008ZNF"  9 "201,330,699.20"       5.71 1
"Nykredit Invest Bæredygtige Aktier KL"    "FS00008ZNH"  9 "515,190,252.50"       6.22 1
"iShares MSCI ACWI ETF USD Acc"             "FS00008ZQ1"  9 "1,766,443,332.00"     6.05 1
"Amundi Russell 1000 Growth ETF Acc"        "FS00008ZQ6"  9 "162,997,909.70"       5.03 1
"Wealth Invest Strategi Aktier"             "FS0000900P" 10 "50,406,422.11"        4.74 1
"DB Advisors Emerg Mkts Eqs Passv ID EUR"   "FS0000901L"  9 "260,794,851.00"       -.48 1
"LO Funds Emerging High Convict SH EUR IA"  "FS0000902S"  4 "6,504,401.32"        -3.28 1
"Spiltan Aktiefond Investmentbolag"         "FS0000903F"  9 "2,378,392,105.00"     9.91 1
"Ofi Invest France Equity I"                "FS0000905L"  6 "8,077,000.00"          7.2 1
"Carnegie Fastighetsfond Norden A"          "FS000090DJ"  9 "195,156,301.40"       4.42 1
"Nordea 1 - Global Real Estate BC EUR"      "FS000090GG"  3 "822.77"               5.69 1
"AB Select US Equity A EUR"                 "FS0000913J"  9 "9,335,473.92"     6.632853 1
"Abacus Tech For Good I"                    "FS0000914T"  9 "8,165,000.00"      6.38704 1
"Robeco Instl Emerging Markets Fund"        "FS00009159" 27 "1,261,555,945.00"     2.54 1
"Robeco QI Instl EM Enhanced Index Eqs Fd"  "FS0000915A" 14 "1,820,868,366.00"     2.95 1
"Nuveen Global Dividend Growth A Eur Acc"   "FS0000917U"  9 "158,004.00"           8.13 1
"Amundi Fds EurEq Sust Inc A2 EUR C"        "FS000091CZ"  7 "1,371,179.89"         9.58 1
"Ninety One GSF Glb Value Eq A Acc EURH"    "FS000091F6"  1 "11,370.00"            3.57 1
"Espiria Global A"                          "FS000091GV"  8 "100,091,856.90"       6.59 1
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#10

13 Feb 2024, 20:54

Your example data contains only treatment cases, no controls. Please post back with a data example that includes both cases and controls.

Also please clarify what you are asking for help with. Is it the calculation of the propensity score? Is it the formation of the matched pairs--if so, once you calculate the propensity score, it works almost the same way as the code you show in #12. The only difference is that local match_vars will be empty, and where it says -joinby `match_vars' using `controls'- you have to replace that with -cross using `controls'-.

Regarding your variable net_asset, Stata holds it as a string because it contains commas. Stata will only accept as numeric a series of digits, optionally preceded by a sign, and optionally containing one decimal point, and optionally containing one exponent. Commas are not permitted. The simplest way to resolve this is:

Code:

destring net_asset, ignore(",") replace

Assuming that -destring- issues no error messages from that, then save this as a new data set, and use the new one from now on. (Archive the old one for reference.)

If -destring- issues an error message, then there is some additional problem with the data, one that is not evident in your example. You can find the observations causing the trouble easily enough by running -list net_asset if missing(real(net_asset)) & !missing(net_asset)-. Then you have to scan that output to see what is wrong with those values of net_asset and either fix them, if possible, or delete them if not.
Comment
Ann Ngo

Join Date: Jun 2023

Posts: 13
#11

13 Feb 2024, 21:33

Originally posted by Clyde Schechter View Post

Your example data contains only treatment cases, no controls. Please post back with a data example that includes both cases and controls.

Also please clarify what you are asking for help with. Is it the calculation of the propensity score? Is it the formation of the matched pairs--if so, once you calculate the propensity score, it works almost the same way as the code you show in #12. The only difference is that local match_vars will be empty, and where it says -joinby `match_vars' using `controls'- you have to replace that with -cross using `controls'-.

Regarding your variable net_asset, Stata holds it as a string because it contains commas. Stata will only accept as numeric a series of digits, optionally preceded by a sign, and optionally containing one decimal point, and optionally containing one exponent. Commas are not permitted. The simplest way to resolve this is:

Code:

destring net_asset, ignore(",") replace

Assuming that -destring- issues no error messages from that, then save this as a new data set, and use the new one from now on. (Archive the old one for reference.)

If -destring- issues an error message, then there is some additional problem with the data, one that is not evident in your example. You can find the observations causing the trouble easily enough by running -list net_asset if missing(real(net_asset)) & !missing(net_asset)-. Then you have to scan that output to see what is wrong with those values of net_asset and either fix them, if possible, or delete them if not.

Thank you very much for your guide. Yes I was confused with the local match_vars and -joinby. However, I have solved it thanks to your explanation. Also, I have followed the destring code and it works.

Thank you for spending your time replying my question. I do appreciate your support. Hope you enjoy the rest of the day.
Comment
Viviana Rpo Pdo

Join Date: Dec 2018

Posts: 6
#12

19 Oct 2024, 19:17

Dear Clyde,
Regarding your answer to the first question and following the same example, could you please guide me on how to decide between:
Estimating the propensity score on the pooled sample, then matching on the exact country and industry (as in the initial question).

Running the full matching procedure (both PS and matching) separately for each country and industry.

Adding dummies for industry and country in the PS calculation and then match on the pooled sample.

I understand that if a variable strongly influences participation, the second approach might be necessary. I am analysing the effect of finding a job through an employment agency on salaries, but wages are influenced by the economic cycle (quarter of the year) and the gender, I’m unsure of the best way to incorporate these two variables in the matching process. Could you advise please? Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#13

19 Oct 2024, 21:21

This is a very good question, one I've never run across before. As I think about it, the three approaches differ only in the way that the propensity score is calculated. In 1, the propensity score is unaware of country and industry. In 2, we go to the opposite extreme: the propensity model itself is country#industry specific. In 3, we have an intermediate approach in which i.industry and i.country figure in the propensity score calculation--so there is some adjustment of PS for country and industry, but it is modestly done.

I don't grasp what you are saying in "I understand that if a variable strongly influences participation, the second approach might be necessary." And I can't find anything in my understanding that says that strength of influence on participation is important in this context. But perhaps I'm just missing something in this regard.

Here's my intuition on how to approach this trilemma. Propensity score matching's effectiveness depends on the estimated PS being a good approximation to the actual probability of being in the group exposed to the intervention/exposure. So I think it's a question of which PS model is the most valid model of the real data generating process. On the one hand, there is likely to be value in representing country and industry in the propensity score if these are indeed relevant to the probability of being in the exposed group. On the other hand, including them as indicator variables adds a large number of degrees of freedom to the model and risks overfitting the noise in the data. And allowing a tailored propensity model for each country#industry group of observations explodes that risk of overfitting to an extreme level.

So what I might do is first try fitting all three PS models: one without country and industry using the entire sample, one with indicators for industry and country, and then another where each country#industry group gets its own PS model (which can be done by including the interaction of industry#country with all of the other variables in the propensity model). Then look at AIC or BIC to get a sense of which of the models gives the best tradeoff between fitting and overfitting.
Comment
Viviana Rpo Pdo

Join Date: Dec 2018

Posts: 6
#14

20 Oct 2024, 09:43

Dear Clyde,

Thank you for your response.

Could you clarify the difference between the second approach, "with indicators for industry and country," and the third, "including the interaction of industry#country with other variables"? Unless you mean including -i.industry and i.country separately for the second option?

In any case, I tried the third approach by adding the interaction of industry#country as a covariate in the PS, but I’m still seeing matches across different industries and countries. I guess I need to split the samples by industry#country before calculating the propensity score.

I was citing "Some Practical Guidance for the Implementation of Propensity Score Matching" by Caliendo & Kopeinig, when referring to variables that strongly influence participation. They indicate that running the PS on specific sub-sambles is "especially recommendable if one expects the effects to be heterogeneous between certain groups"

Thanks again for your help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29796
#15

20 Oct 2024, 09:57

The third approach looks like this:

Code:

logit exposed_group relevant_variables i.country i.industry

The second approach looks like this:

Code:

logit exposed group i.country##i.industry##(relevant_variables)

(Don't forget to use appropriaate c. and i. prefixes for the relevant_variables.)

They indicate that running the PS on specific sub-sambles is "especially recommendable if one expects the effects to be heterogeneous between certain groups"

Yes, that makes sense. That is different from "if a variable strongly influences participation."
Comment

Announcement