Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching Case-Control Code Needed

    Hello Everyone! I'm wondering if someone could help me with the code for matching a case-control population or point me in the direction of literature/existing code. We have a 122 cases and 85 controls (recruitment out of a cancer clinic). We would like assess two scenarios for matching to find the best fit for the dataset.

    Matching variables:
    -gender
    -cotinine (+/- 100 pg)
    -years_smoked (+/- 5 years) - (desired but likely won't be able to match with the third variable due to sample size).

    Scenarios:
    1. N 1:1
    2. N 1:3 with repeated cases in the pool. We would like to keep the ratio favoring controls because the number of matches will be low to keep the power in check.

    ​Any help would be much appreciated!​
    ​Erin​

  • #2
    There is a community-contributed program -calipmatch- from SSC that will at least get you the exact match on gender and one of the two range matches, from which you could then keep only those that also fall in range on the third variable. (Or maybe -calipmatch- can do multiple range matches. I'm not sure because I don't use it myself, and right now I can't check because the SSC website seems to be down as I write this.)

    So here's how to do it with just official Stata commands:

    Code:
    use dataset, clear
    preserve
    keep if case
    drop case
    tempfile cases
    save `cases'
    restore
    drop if case
    drop case
    ds gender, not
    rename (`r(varlist)') =_ctrl
    tempfile controls
    save `controls'
    
    use `cases'
    joinby gender using `controls', unmatched(master)
    keep if abs(cotinine - cotinine_ctrl) <= 100 ///
        & abs(years_smoked - years_smoked_ctrl) <= 5
    
    set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
    gen double shuffle = runiform()
    duplicates drop
    by id (shuffle), sort: keep if _n <= 3
    drop shuffle
    In addition to the matching variables mentioned, I assume the data set has an ID number of each patient, called id, and a variable case which is coded 1 for cases and 0 for controls.

    At the end of this code, the data in memory will have up to three observations for each of the original cases. Each observation has the case paired with a control who meets the three matching criteria. The variables describing the control will all have _ctrl suffixed to their names. For the matchup with just 1 control, just retain the first control matched to each case.

    Note: No example data was provided, so beware of typos or other errors as this code is untested. In the future, please provide example data when asking for help with code, and use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #3
      -calipmatch- does allow calipers on more than one variable at a time; however, I believe it only does matching without replacement; the user-written command -vmatch- also allows more than one variable with calipers (yes they can differ) and does matching with replacement; each can be found, with installation instructions by using the -search- command

      Comment


      • #4
        Thank you both for your responses. I greatly appreciate them! So far I've used the Stata commands to match, with success. I adjusted the variable names to match my dataset (I was using cleaner names). I want to try the -calipmatch- and -vmatch- commands next, to try matching with replacement. I am working with an MD's dataset for my MPH thesis, so this is a learning process for me. I may have more questions once I look into these further. Thank you again!

        Here is the adjusted code I used and my -dataex-.
        use tobacco_biomarkers, clear
        preserve
        keep if case
        drop case
        tempfile cases
        save `cases'
        restore
        drop if case
        drop case
        ds gender, not
        rename (`r(varlist)') =_ctrl
        tempfile controls
        save `controls'

        use `cases'
        joinby gender using `controls', unmatched(master)
        keep if abs(total_urinary_cotinine - total_urinary_cotinine_ctrl) <= 100 & abs(smoke_duration - smoke_duration_ctrl) <= 5

        set seed 1234
        gen double shuffle = runiform()
        duplicates drop
        by record_id (shuffle), sort: keep if _n <= 3
        drop shuffle

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str10 record_id byte case int total_urinary_cotinine byte gender float smoke_duration
        "101" 1 3022 1 40
        "102" 1 1818 1 20
        "103" 1 3629 1 35
        "104" 1 3343 1 10
        "105" 1  802 1 60
        "106" 1 7954 1  0
        "107" 1    . 1 45
        "108" 1    . 1 60
        "109" 1 4780 1 40
        "110" 1 4610 1  0
        "111" 1 4227 2 22
        "112" 1 1367 1 35
        "113" 1 6540 1  0
        "114" 1 5047 1 24
        "115" 1 2672 1 44
        "116" 1 2170 1  5
        "117" 1   83 1 20
        "118" 1 4128 2 28
        "119" 1  889 2 50
        "120" 1 1412 2  0
        end

        Comment


        • #5
          Hello Again, I am wondering how you would re-write the code to match only for cotinine (not first by gender). Is there an easy modification to the code you listed or would it be best to to use calipmatch or vmatch?

          Thank you - I greatly appreciate your help!

          Comment


          • #6
            Replace -joinby gender using `controls', unmatched(master)- with -cross using `controls'-. The rest of the code would be unchanged.

            Comment


            • #7
              Thank you so much!

              Comment


              • #8
                Hello again, I am wondering now how it would be to match by creating a new variable with a match id instead of relocating matched data into the same row. I ask because I would like to do the ranksum test, and it seems like I need to have one variable to test and one grouping variable (case/control). Any help is much appreciated!

                Comment


                • #9
                  The -ranksum- test is not valid for paired data. The closest thing to that is the -signrank- test, which would use the data in wide layout as you currently have it.



                  Comment


                  • #10
                    Originally posted by Clyde Schechter View Post
                    There is a community-contributed program -calipmatch- from SSC that will at least get you the exact match on gender and one of the two range matches, from which you could then keep only those that also fall in range on the third variable. (Or maybe -calipmatch- can do multiple range matches. I'm not sure because I don't use it myself, and right now I can't check because the SSC website seems to be down as I write this.)

                    So here's how to do it with just official Stata commands:

                    Code:
                    use dataset, clear
                    preserve
                    keep if case
                    drop case
                    tempfile cases
                    save `cases'
                    restore
                    drop if case
                    drop case
                    ds gender, not
                    rename (`r(varlist)') =_ctrl
                    tempfile controls
                    save `controls'
                    
                    use `cases'
                    joinby gender using `controls', unmatched(master)
                    keep if abs(cotinine - cotinine_ctrl) <= 100 ///
                    & abs(years_smoked - years_smoked_ctrl) <= 5
                    
                    set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
                    gen double shuffle = runiform()
                    duplicates drop
                    by id (shuffle), sort: keep if _n <= 3
                    drop shuffle
                    In addition to the matching variables mentioned, I assume the data set has an ID number of each patient, called id, and a variable case which is coded 1 for cases and 0 for controls.

                    At the end of this code, the data in memory will have up to three observations for each of the original cases. Each observation has the case paired with a control who meets the three matching criteria. The variables describing the control will all have _ctrl suffixed to their names. For the matchup with just 1 control, just retain the first control matched to each case.

                    Note: No example data was provided, so beware of typos or other errors as this code is untested. In the future, please provide example data when asking for help with code, and use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

                    When asking for help with code, always show example data. When showing example data, always use -dataex-.
                    Hi, I know this is an old post but I have used the script provided in #2 several times before in different projects. Now my problem is the size of the dataset (millions of observations). I want to match cases with controls on the variables age and sex and then randomly choose 5 of the matching controls for each case.
                    The line
                    Code:
                    joinby gender using `controls', unmatched(master)
                    takes days to run. Most likely because there are so many potential controls for each case ( i.e. there are several matches based on age and sex for each case).

                    Is there a solution that randomly matches the controls, not needing to join all available matches but just the first five controls. I hope you understand my question.

                    Best regards,

                    Jesper Eriksson

                    Comment


                    • #11
                      I'm actually surprised that the problem is that -rangejoin- is taking too long to run because the data set is so large. The usual problem in that circumstance is that the resulting intermediate data set would exceed available memory, so usually it just aborts with an "Op. sys. refuses to provide memory" message. I don't think I've ever seen your situation before.

                      Be that as it may, I don't know of any way to join only the first five matching controls. But, there is a way to do this faster than -joinby- will allow, by using, instead, the -rangejoin- command. It is written by Robert Picard and available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

                      Code:
                      use dataset, clear
                      preserve
                      keep if case
                      drop case
                      tempfile cases
                      save `cases'
                      restore
                      drop if case
                      drop case
                      ds gender, not
                      rename (`r(varlist)') =_ctrl
                      tempfile controls
                      save `controls'
                      
                      use `cases'
                      gen lb = cotinine - 100
                      gen ub = cotinine + 100
                      rangejoin cotinine_ctrl lb ub using `controls', by(gender)
                      keep if abs(years_smoked - years_smoked_ctrl) <= 5
                      
                      set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
                      gen double shuffle1 = runiform()
                      gen double shuffle2 = runiform()
                      duplicates drop
                      by id (shuffle*), sort: keep if _n <= 5
                      drop shuffle
                      Notes: I'm assuming here that getting the match on cotinine is harder than getting the match on years smoked, so that applying -rangejoin- on the cotinine match will produce a smaller resulting data set. But if the match on years smoked is easier, you should reverse the roles of those variables in the above code. -rangejoin- is faster than -joinby- and also produces an intermediate data set that includes only allowable matches on the cotinine variable. Note that I have also modified the random selection, using two random variables. The reason is that in a data set with several million observations, a single double-precision random variable may have some repeated values, so that the sorting, and hence the selection of controls, would be indeterminate and irreproducible. The use of two double-precision random variables overcomes this potential difficulty. It is only necessary to do this when working with very large data sets (several million observations or more).

                      Comment


                      • #12
                        Originally posted by Clyde Schechter View Post
                        I'm actually surprised that the problem is that -rangejoin- is taking too long to run because the data set is so large. The usual problem in that circumstance is that the resulting intermediate data set would exceed available memory, so usually it just aborts with an "Op. sys. refuses to provide memory" message. I don't think I've ever seen your situation before.

                        Be that as it may, I don't know of any way to join only the first five matching controls. But, there is a way to do this faster than -joinby- will allow, by using, instead, the -rangejoin- command. It is written by Robert Picard and available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

                        Code:
                        use dataset, clear
                        preserve
                        keep if case
                        drop case
                        tempfile cases
                        save `cases'
                        restore
                        drop if case
                        drop case
                        ds gender, not
                        rename (`r(varlist)') =_ctrl
                        tempfile controls
                        save `controls'
                        
                        use `cases'
                        gen lb = cotinine - 100
                        gen ub = cotinine + 100
                        rangejoin cotinine_ctrl lb ub using `controls', by(gender)
                        keep if abs(years_smoked - years_smoked_ctrl) <= 5
                        
                        set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
                        gen double shuffle1 = runiform()
                        gen double shuffle2 = runiform()
                        duplicates drop
                        by id (shuffle*), sort: keep if _n <= 5
                        drop shuffle
                        Notes: I'm assuming here that getting the match on cotinine is harder than getting the match on years smoked, so that applying -rangejoin- on the cotinine match will produce a smaller resulting data set. But if the match on years smoked is easier, you should reverse the roles of those variables in the above code. -rangejoin- is faster than -joinby- and also produces an intermediate data set that includes only allowable matches on the cotinine variable. Note that I have also modified the random selection, using two random variables. The reason is that in a data set with several million observations, a single double-precision random variable may have some repeated values, so that the sorting, and hence the selection of controls, would be indeterminate and irreproducible. The use of two double-precision random variables overcomes this potential difficulty. It is only necessary to do this when working with very large data sets (several million observations or more).
                        Thank you (again!) Clyde. Works like a charm and testing using only small parts of my dataset seems to give a great time decrease. Thank you!

                        Comment


                        • #13
                          Originally posted by Clyde Schechter View Post
                          I'm actually surprised that the problem is that -rangejoin- is taking too long to run because the data set is so large. The usual problem in that circumstance is that the resulting intermediate data set would exceed available memory, so usually it just aborts with an "Op. sys. refuses to provide memory" message. I don't think I've ever seen your situation before.

                          Be that as it may, I don't know of any way to join only the first five matching controls. But, there is a way to do this faster than -joinby- will allow, by using, instead, the -rangejoin- command. It is written by Robert Picard and available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also from SSC.

                          Code:
                          use dataset, clear
                          preserve
                          keep if case
                          drop case
                          tempfile cases
                          save `cases'
                          restore
                          drop if case
                          drop case
                          ds gender, not
                          rename (`r(varlist)') =_ctrl
                          tempfile controls
                          save `controls'
                          
                          use `cases'
                          gen lb = cotinine - 100
                          gen ub = cotinine + 100
                          rangejoin cotinine_ctrl lb ub using `controls', by(gender)
                          keep if abs(years_smoked - years_smoked_ctrl) <= 5
                          
                          set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
                          gen double shuffle1 = runiform()
                          gen double shuffle2 = runiform()
                          duplicates drop
                          by id (shuffle*), sort: keep if _n <= 5
                          drop shuffle
                          Notes: I'm assuming here that getting the match on cotinine is harder than getting the match on years smoked, so that applying -rangejoin- on the cotinine match will produce a smaller resulting data set. But if the match on years smoked is easier, you should reverse the roles of those variables in the above code. -rangejoin- is faster than -joinby- and also produces an intermediate data set that includes only allowable matches on the cotinine variable. Note that I have also modified the random selection, using two random variables. The reason is that in a data set with several million observations, a single double-precision random variable may have some repeated values, so that the sorting, and hence the selection of controls, would be indeterminate and irreproducible. The use of two double-precision random variables overcomes this potential difficulty. It is only necessary to do this when working with very large data sets (several million observations or more).
                          A final question, the code you provided in #11 results in matching with replacements. Is there a twist where you can get without replacements?

                          Comment


                          • #14
                            With the caveat that this is untested due to absence of a data example to work with:
                            Code:
                            use dataset, clear
                            preserve
                            keep if case
                            drop case
                            tempfile cases
                            save `cases'
                            restore
                            drop if case
                            drop case
                            ds gender, not
                            rename (`r(varlist)') =_ctrl
                            tempfile controls
                            save `controls'
                            
                            use `cases'
                            gen lb = cotinine - 100
                            gen ub = cotinine + 100
                            rangejoin cotinine_ctrl lb ub using `controls', by(gender)
                            keep if abs(years_smoked - years_smoked_ctrl) <= 5
                            
                            set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
                            gen double shuffle1 = runiform()
                            gen double shuffle2 = runiform()
                            duplicates drop
                            sort id (shuffle*)
                            
                            local cc_ratio 5
                            
                            local i = 1
                            while `i' < _N {
                                quietly count if id == id[`i']
                                local npicks = min(`r(N)', `cc_ratio')
                                forvalues ii = 0/`=`npicks'-1' {
                                    drop if id_ctrl == id_ctrl[`i'+`ii'] in `=`i'+`npicks''/L
                                }
                                drop if id == id[`i'] in `=`i'+`npicks''/L
                                local i = `i' + `npicks'
                            }
                            A couple of remarks: this is going to be very slow in a large data set because it crawls through the data set observation by observation and the commands inside the loop have -if- conditions that must be evaluated on every observation in the entire subset of observations that meet the -in- condition. In addition to the computational drawbacks, the inability to reuse control observations typically results in some cases not getting their full complement of matches, or even getting none at all. The exclusion of those unmatchable cases may then lead to a biased analytic sample because it is going to be cases with less common values of the match variables that are selectively removed. If there were some major statistical advantage to sampling without replacement, that might warrant its use. But there isn't. It just seems to satisfy some people's esthetic preferences. So I advise against doing this.

                            Comment

                            Working...
                            X