Dear all,
I would like to ask for your advice on creating a dataset of counterfactuals based on several criteria.
Some background information on the data structure: The initial dataset contains investors (variable: investor_id) who invested in certain companies (variable: company_id) in different industries (variable: industry) at a specific point of time (variable: date). An investment into a specific company at a specific time is marked with an investment_id. Each line in the dataset represents a unique combination of one investor with a specific company. However, there are always multiple co-investors investing together in a specific company (several investors having the same investment_id indicates that they are co-investing together).
I would like to create a set of counterfactual co-investors for each actual co-investment (i.e. other investors who could have co-invested but did not) based on the following criteria:
(1) both co-investors do not work at the same firm (variable: investor_firm)
(2) the counterfactual co-investor must have invested in the same industry within 30 days (variable: date) of the actual co-investment
(3) the counterfactual co-investors must not have co-invested with the other investor previously
The structure of the final dataset would then look like this (example for investment_id 190 from above):
How would you recommend to go about this?
Conceptually, I was thinking to create a matrix of all possible investor - co-investor combinations (both actual and counterfactual) and generate a dummy that indicates if it is an actual co-investment (i.e. same investment_id) (however, this step would be computationally intensive since the overall dataset is quite large). Next I was thinking to generate a dummy variable for each of the 3 conditions above, indicating if the condition is met. Only if all 3 conditions are met, then the observations qualifies as a counterfactual.
I would very much appreciate your ideas and help with implementing the Stata code. Many thanks in advance!
I would like to ask for your advice on creating a dataset of counterfactuals based on several criteria.
Some background information on the data structure: The initial dataset contains investors (variable: investor_id) who invested in certain companies (variable: company_id) in different industries (variable: industry) at a specific point of time (variable: date). An investment into a specific company at a specific time is marked with an investment_id. Each line in the dataset represents a unique combination of one investor with a specific company. However, there are always multiple co-investors investing together in a specific company (several investors having the same investment_id indicates that they are co-investing together).
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input int investment_id str2 company int(investor_id date) str18 industry str2 investor_firm 190 "AA" 317 15414 "Tech" "II" 190 "AA" 479 15414 "Tech" "II" 268 "BB" 169 19330 "Consumer Goods" "JJ" 268 "BB" 106 19330 "Consumer Goods" "KK" 533 "CC" 837 18067 "Consumer Goods" "LL" 533 "CC" 2400 18067 "Consumer Goods" "MM" 542 "DD" 1460 15400 "Tech" "NN" 542 "DD" 279 15400 "Tech" "OO" 747 "EE" 2020 19299 "Tech" "PP" 747 "EE" 45 19299 "Tech" "QQ" 889 "FF" 318 14728 "Financial Services" "II" 889 "FF" 479 14728 "Financial Services" "II" 889 "FF" 317 14728 "Financial Services" "II" 927 "GG" 2026 20114 "Healthcare" "XX" 927 "GG" 4129 20114 "Healthcare" "XX" 1107 "HH" 72 17439 "Healthcare" "II" 1107 "HH" 418 17439 "Healthcare" "NN" end format %tdnn/dd/CCYY date
(1) both co-investors do not work at the same firm (variable: investor_firm)
(2) the counterfactual co-investor must have invested in the same industry within 30 days (variable: date) of the actual co-investment
(3) the counterfactual co-investors must not have co-invested with the other investor previously
The structure of the final dataset would then look like this (example for investment_id 190 from above):
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input int investment_id str2 company int(investor_id date) str4 industry str2 investor_firm int coinvestor_id byte actual_coinvestor 190 "AA" 317 15414 "Tech" "II" 479 1 190 "AA" 317 15414 "Tech" "NN" 1460 0 190 "AA" 317 15414 "Tech" "OO" 279 0 end format %tdnn/dd/CCYY date
How would you recommend to go about this?
Conceptually, I was thinking to create a matrix of all possible investor - co-investor combinations (both actual and counterfactual) and generate a dummy that indicates if it is an actual co-investment (i.e. same investment_id) (however, this step would be computationally intensive since the overall dataset is quite large). Next I was thinking to generate a dummy variable for each of the 3 conditions above, indicating if the condition is met. Only if all 3 conditions are met, then the observations qualifies as a counterfactual.
I would very much appreciate your ideas and help with implementing the Stata code. Many thanks in advance!
Comment