Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calipmatch- everytime it runs, the number of matches and matched pairs change

    Is there a seed option or something for this? not sure what i am doing wrong? I was under the assumption that it was an efficient process of finding the exact matches and then increasing caliper width slowly upto the max..
    Code:
    calipmatch, generate(matchid) casevar(pt_type2 ) maxmatches(1) calipermatch(c_age_tracking ) caliperwidth(5) exactmatch(gndr c_race)

  • #2
    I am not really familiar with the -calipmatch- program, which is not an official Stata command. But reading its -help- file, two things are clear. First, there is no seed-setting option in -calipmatch- itself. Second, the process of matching is done by random selection without replacement. This implies that which observations get matched, and to how many controls, is going to vary from one run of the code to the next.

    It is likely, though I cannot guarantee it, that if you use the general -set seed- and -set sortseed- Stata commands you can get reproducible results. If you are not familiar with these commands, see their respective -help- files.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      I am not really familiar with the -calipmatch- program, which is not an official Stata command. But reading its -help- file, two things are clear. First, there is no seed-setting option in -calipmatch- itself. Second, the process of matching is done by random selection without replacement. This implies that which observations get matched, and to how many controls, is going to vary from one run of the code to the next.

      It is likely, though I cannot guarantee it, that if you use the general -set seed- and -set sortseed- Stata commands you can get reproducible results. If you are not familiar with these commands, see their respective -help- files.
      This fixed it! Set seed 1234 or whatever will ensure that i recreate the same thing every time. Think that the writers of calipmatch should consider adding a seed option to their function

      Comment


      • #4
        Woops- i mis-saw! turns out the problem persists in my dataset. Its just a matter of running it a few times. I see 138 matches then 140 or 139. I wonder what the solution could be for this

        Comment


        • #5
          Again, having no ideas about what is going on within -calipmatch-, the general principle remains that irreproducible results in Stata typically arises from indeterminate sorts, which Stata randomizes.

          So if merely setting the random number generator and sort seeds has not resolved the problem, there are a few possibilities:
          1. If you have set those seeds just before invoking -calipmatch-, it is possible that your code is already producing different results on each re-run even before it even gets to those -set- commands. This could arise if your code either explicitly or implicitly sorts the data on a sort key that does not uniquely identify observations: Stata randomizes the order within the sort key when you do this. So you need to check whether you are actually starting with the identical data set (including its sort order) each time immediately before the -set- commands. If the data are already scrambled, tweaking the seeds won't help you.
          2. If you set those seeds at a distance before invoking -calipmatch-, it is possible that between then and the invocation of -calipmatch- something in your code is scrambling the data. Again, the usual source is a command that sorts the data (explicitly or implicitly) on a sort key that does not uniquely identify observations. So, you need to check whether you are actually starting with the exact same data (including its sort order) just before you call -calipmatch- each time.
          3. The third possibility I see is that -calipmatch- itself is indeterminate (again, this would most likely arise from sorting on a non-identifying sort key.). You could test for that by setting up a "standard" data set that you save to disk. Then write a simple loop that goes something like this:
          Code:
          tempfile previous
          forvalues i = 1/10 {
              use standard_data_set, clear
              set seed 1234
              set sortseed 56789
              calipmatch ...   // CHOOSE SOME REALISTIC MATCHING PARAMETER VALUES FOR THE OPTIONS
              if `i' > 1 {
                  cf _all using `previous'
              }
              save `"`previous'"', replace
          }
          This way you are "force feeding" -calipmatch- the same data set and the same seeds 10 times, and after all but the first, comparing the results with the previous results. If the code makes it all the way through without -cf- complaining about a discrepancy, then we can be reasonably confident that -calipmatch- itself is not the source of the problem.

          Comment


          • #6
            Something i noticed: When i run a section of my do file starting from
            Code:
            Use "data", clear
            set seed 1896
            calipmatch.........
            It gives me a constant 140 no matter now many times i do it. But if i run it from the Top, until this calipmatch, it keeps changing?

            Comment


            • #7
              So something between the top of your code and calipmatch is scrambling your data. Look for anything that sorts the data and then scrutinize whether the sort key uniquely identifies observations in the data set. Somewhere in there you will almost certainly find a sort key that does not--and that, or those, (there could be more than one) are your culprits.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                So something between the top of your code and calipmatch is scrambling your data. Look for anything that sorts the data and then scrutinize whether the sort key uniquely identifies observations in the data set. Somewhere in there you will almost certainly find a sort key that does not--and that, or those, (there could be more than one) are your culprits.
                awesome okay! i will investigate and report back here if i find something. Thank you for your insights!!

                Comment


                • #9
                  The only solution to this is to run it, and freeze the control matches as a "final dataset" and use that thereafter// i dont think its inherently reproducible even with set seed

                  Comment

                  Working...
                  X