Calipmatch- everytime it runs, the number of matches and matched pairs change

Sakshi Rajatbhai Tewari

Join Date: Apr 2022

Posts: 53
#1

Calipmatch- everytime it runs, the number of matches and matched pairs change

29 Jan 2024, 14:16

Is there a seed option or something for this? not sure what i am doing wrong? I was under the assumption that it was an efficient process of finding the exact matches and then increasing caliper width slowly upto the max..

Code:

calipmatch, generate(matchid) casevar(pt_type2 ) maxmatches(1) calipermatch(c_age_tracking ) caliperwidth(5) exactmatch(gndr c_race)
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#2

29 Jan 2024, 15:05

I am not really familiar with the -calipmatch- program, which is not an official Stata command. But reading its -help- file, two things are clear. First, there is no seed-setting option in -calipmatch- itself. Second, the process of matching is done by random selection without replacement. This implies that which observations get matched, and to how many controls, is going to vary from one run of the code to the next.

It is likely, though I cannot guarantee it, that if you use the general -set seed- and -set sortseed- Stata commands you can get reproducible results. If you are not familiar with these commands, see their respective -help- files.
1 like
Comment
Sakshi Rajatbhai Tewari

Join Date: Apr 2022

Posts: 53
#3

30 Jan 2024, 06:57

Originally posted by Clyde Schechter View Post

I am not really familiar with the -calipmatch- program, which is not an official Stata command. But reading its -help- file, two things are clear. First, there is no seed-setting option in -calipmatch- itself. Second, the process of matching is done by random selection without replacement. This implies that which observations get matched, and to how many controls, is going to vary from one run of the code to the next.

It is likely, though I cannot guarantee it, that if you use the general -set seed- and -set sortseed- Stata commands you can get reproducible results. If you are not familiar with these commands, see their respective -help- files.

This fixed it! Set seed 1234 or whatever will ensure that i recreate the same thing every time. Think that the writers of calipmatch should consider adding a seed option to their function
Comment
Sakshi Rajatbhai Tewari

Join Date: Apr 2022

Posts: 53
#4

30 Jan 2024, 11:57

Woops- i mis-saw! turns out the problem persists in my dataset. Its just a matter of running it a few times. I see 138 matches then 140 or 139. I wonder what the solution could be for this
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

30 Jan 2024, 12:39

Again, having no ideas about what is going on within -calipmatch-, the general principle remains that irreproducible results in Stata typically arises from indeterminate sorts, which Stata randomizes.

So if merely setting the random number generator and sort seeds has not resolved the problem, there are a few possibilities:
If you have set those seeds just before invoking -calipmatch-, it is possible that your code is already producing different results on each re-run even before it even gets to those -set- commands. This could arise if your code either explicitly or implicitly sorts the data on a sort key that does not uniquely identify observations: Stata randomizes the order within the sort key when you do this. So you need to check whether you are actually starting with the identical data set (including its sort order) each time immediately before the -set- commands. If the data are already scrambled, tweaking the seeds won't help you.

If you set those seeds at a distance before invoking -calipmatch-, it is possible that between then and the invocation of -calipmatch- something in your code is scrambling the data. Again, the usual source is a command that sorts the data (explicitly or implicitly) on a sort key that does not uniquely identify observations. So, you need to check whether you are actually starting with the exact same data (including its sort order) just before you call -calipmatch- each time.

The third possibility I see is that -calipmatch- itself is indeterminate (again, this would most likely arise from sorting on a non-identifying sort key.). You could test for that by setting up a "standard" data set that you save to disk. Then write a simple loop that goes something like this:

Code:

tempfile previous forvalues i = 1/10 { use standard_data_set, clear set seed 1234 set sortseed 56789 calipmatch ... // CHOOSE SOME REALISTIC MATCHING PARAMETER VALUES FOR THE OPTIONS if `i' > 1 { cf _all using `previous' } save `"`previous'"', replace }

This way you are "force feeding" -calipmatch- the same data set and the same seeds 10 times, and after all but the first, comparing the results with the previous results. If the code makes it all the way through without -cf- complaining about a discrepancy, then we can be reasonably confident that -calipmatch- itself is not the source of the problem.
Comment
Sakshi Rajatbhai Tewari

Join Date: Apr 2022

Posts: 53
#6

30 Jan 2024, 13:02

Something i noticed: When i run a section of my do file starting from

Code:

Use "data", clear set seed 1896 calipmatch.........

It gives me a constant 140 no matter now many times i do it. But if i run it from the Top, until this calipmatch, it keeps changing?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#7

30 Jan 2024, 13:29

So something between the top of your code and calipmatch is scrambling your data. Look for anything that sorts the data and then scrutinize whether the sort key uniquely identifies observations in the data set. Somewhere in there you will almost certainly find a sort key that does not--and that, or those, (there could be more than one) are your culprits.
1 like
Comment
Sakshi Rajatbhai Tewari

Join Date: Apr 2022

Posts: 53
#8

31 Jan 2024, 12:11

Originally posted by Clyde Schechter View Post

So something between the top of your code and calipmatch is scrambling your data. Look for anything that sorts the data and then scrutinize whether the sort key uniquely identifies observations in the data set. Somewhere in there you will almost certainly find a sort key that does not--and that, or those, (there could be more than one) are your culprits.

awesome okay! i will investigate and report back here if i find something. Thank you for your insights!!
Comment
Sakshi Rajatbhai Tewari

Join Date: Apr 2022

Posts: 53
#9

05 Feb 2024, 13:01

The only solution to this is to run it, and freeze the control matches as a "final dataset" and use that thereafter// i dont think its inherently reproducible even with set seed
Comment

Announcement

Calipmatch- everytime it runs, the number of matches and matched pairs change

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment