Dear Stata users,
I have a loop for selecting control observations without replacement for my case observations in a case control study. Unfortunately, this runs slowly because the dataset is quite large (around 5000 cases and 8 million controls to select from.) Thus, I’m looking to recreate the following loop in Mata with any tweaks that may improve the execution time as this has to be run for multiple case-control configurations. Currently, a single run is an over-night endeavor using Stata 18 on Windows Server 2012R2.
Due to legal reasons, I cannot disclose the dataset. Instead, I will try my best to illustrate the loop with the auto dataset.
I am aware of the excellent calipmatch package, which address 95% of my issue. However, one of the matching criteria is that the control observations must have a date value that is larger than that of the case observation, which to my understanding is not possible with calipmatch, because calipmatch only allows exact matches or caliper widths going above and below a certain score and not 'greater than' criterions. I tried to understand the calipmatch source code and modify it to my needs, but it is beyond my Mata skill level.
A part of the reason why the current loop runs slow is that it identifies all possible matches and then finds the first X matches. In this case, it is two matches. In my actual data, there will be 30 matches and there will be two more exact matching criteria, i.e., having the same value on a categorical variable. In the following example, imagine that the variable 'price' would correspond to the date variable in the actual dataset. I have annotated it to illustrate my intention.
Please let me know if I can help by elaborating anything further.
Best regards Soeren
I have a loop for selecting control observations without replacement for my case observations in a case control study. Unfortunately, this runs slowly because the dataset is quite large (around 5000 cases and 8 million controls to select from.) Thus, I’m looking to recreate the following loop in Mata with any tweaks that may improve the execution time as this has to be run for multiple case-control configurations. Currently, a single run is an over-night endeavor using Stata 18 on Windows Server 2012R2.
Due to legal reasons, I cannot disclose the dataset. Instead, I will try my best to illustrate the loop with the auto dataset.
I am aware of the excellent calipmatch package, which address 95% of my issue. However, one of the matching criteria is that the control observations must have a date value that is larger than that of the case observation, which to my understanding is not possible with calipmatch, because calipmatch only allows exact matches or caliper widths going above and below a certain score and not 'greater than' criterions. I tried to understand the calipmatch source code and modify it to my needs, but it is beyond my Mata skill level.
A part of the reason why the current loop runs slow is that it identifies all possible matches and then finds the first X matches. In this case, it is two matches. In my actual data, there will be 30 matches and there will be two more exact matching criteria, i.e., having the same value on a categorical variable. In the following example, imagine that the variable 'price' would correspond to the date variable in the actual dataset. I have annotated it to illustrate my intention.
Code:
sysuse auto generate byte case = strpos(make, "Buick") > 0 // Make a case group, in this case it is the Buick cars. count if case == 0 // 67 control observations to select from. gen float tmp = runiform() sort tmp // Setting the order of the observation at random gen tmp2 = . // Making temporary variable that will be rewritten throughout the loop. gen sto = . // Making a storage variable that will assign matched observations with the same ID as each case. bysort sto case: replace sto = _n if case == 1 // Adding IDs to case numbers. forvalues i=1/67{ levelsof price if sto == `i', local(A) // Saving the value of price in a local. replace tmp2 = `i' if price > `A' & missing(sto) // Identifying all observations with prices higher than the ith case. replace sto = `i' if price > `A' & missing(sto) & sum(tmp2 == `i') <= 2 // Saving the first two matches as controls. replace tmp2 = . // Resets the temporary storage. }
Best regards Soeren
Comment