Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching sample

    I am doing a matched case control, I have my cases (27964) in one dataset and my possible controls (2900000) in another one.
    I would like to match the cases based on date (dnoutcome) and place of birth (id). both files have these information and an unique identifier. I would need 10 matching controls for each case.
    How can I find a matching sample? My controls sample have to be unique for each case.
    Could I do it using Stata.
    Thank you

  • #2
    Do you really have a place of birth variable named id, and then some other variable for a unique identifier? I'd lose my mind trying to keep that straight through a code file. So I'm going to use your dnoutcome variable, and call the place of birth variable birth_place. I'll call the unique identifier in the file of cases uid_case, and that in the file of possible controls uid_control. (They must have different names: if they currently don't, rename at least one of them.)

    Code:
    use cases, clear
    joinby dnoutcome birth_place using `controls', unmatched(master)
    At this point, your data in memory will have each case matched up with every potential control that has the same values for dnoutcome and birth_place. Now you need to weed it down to 10 per case. Often people will have other criteria for selecting the "best 10" matches, such as nearest possible in age, or same gender, or something like that. If you want to restrict further in this way, you can add gender to the variable list in the -joinby- command, and you can restrict on age by following the code above with -keep if abs(age_control - age_case) < my_age_range_threshold-. If there are no such criteria, and you just want to pick them at random, you can do:

    Code:
    set seed my_lucky_number_here
    gen double shuffle = runiform() 
    by case_id (shuffle), sort: keep if _n <= 10
    drop shuffle
    Notes:
    1. It is possible that some cases will not have 10 satisfactory matches, and even possible that some will not have any.
    2. This assignment is sampling with replacement, that is, different cases may have some (or all) the same controls matched to them if the cases agree on birth_place and dnoutcome. If you don't want to "re-use" controls in this way, it can be done but the code is much more complicated.
    3. If your data sets are very large (several millions of observations), a single double-precision random number may be insufficient to determine a unique sort order in the second block of code. In that case, also do -gen double shuffle2 = runiform()- and change -by case_id (shuffle), sort:...- to -by case_id (shuffle shuffle2), sort:...-

    Comment


    • #3
      Hi
      Thank you very much for your support
      I do not want to re-use my controls. Could you help me with the more complicated code?
      Please

      Comment


      • #4
        Try this:

        Code:
        use cases, clear
        joinby dnoutcome birth_place using `controls', unmatched(master)
        set seed my_lucky_number_here
        gen double shuffle = runiform() 
        by case_id (shuffle), sort: gen control_num = _n
        
        // LOOP OVER CASES
        levelsof case_id, local(cases)
        gen byte surviving = 1
        foreach c of local cases {
            // RETAIN FIRST 10 SURVIVING CONTROLS FOR THIS CASE
            replace surviving = 0 if case_id == `c' & control_num > 10
            
            // ELIMINATE THOSE CONTROLS FROM CONSIDERATION FOR LATER CASES
            levelsof control_id if case_id == `c' & surviving, local(to_kill)
            foreach tk of local to_kill {
                replace surviving = 0 if case_id != `c' & control_id == `tk'
            }
            drop if surviving == 0
            
            // RE-NUMBER REMAINING CONTROLS
            by case_id (shuffle), sort: replace control_num = _n
        }
        Note: I don't have a useful set of data available to test this on, so I may have left some typos in there. I hope not.

        As before, it is possible that there will be some cases with fewer than 10 controls available, or even with none. In fact, because you are not re-using controls, this is even more likely than before.

        Comment


        • #5
          I am trying the code, but it still running since yesterday.
          Thank you very much.
          I let you know as soon as finish.
          regards

          Comment


          • #6
            Here's an alternate solution that uses rangejoin (from SSC). To install it, type in Stata's Command window:

            Code:
            ssc install rangejoin
            I assume that cases and controls are to be matched only if they have the same value for dnoutcome and place_id. Also, once a control is picked, it cannot be picked again. If both cases and controls are randomly ordered within dnoutcome place_id groups, then the first case in the group can be matched to the first 10 controls in the group. The second case can be matched to controls #11 to 20, and so on.

            Here's a quick sketch of what it would look like. The following code runs in about 12 seconds on my computer:

            Code:
            set seed 1234321
            
            * generate cases with 5 possible outcomes from about 300 places
            clear
            set obs 27964
            gen case_id = _n
            gen dnoutcome =  runiformint(1,5)
            gen place_id = int(1 / runiform())
            save "cases.dta", replace
            
            * repeat for controls
            clear
            set obs 2900000
            gen control_id = _n
            gen dnoutcome =  runiformint(1,5)
            gen place_id = int(1 / runiform())
            
            * mix up the controls randomly within groups of dnoutcome place_id
            * and assign a pick order
            gen mix_it_up = runiform()
            bysort dnoutcome place_id (mix_it_up): gen pick_order = _n
            save "controls.dta", replace
            
            * mix up cases randomly within groups of dnoutcome place_id and
            * assign pick numbers using indexes (e.g. the first obs picks 1 to 10;
            * the second obs picks 11 to 20, etc)
            use "cases.dta", clear
            gen mix_it_up = runiform()
            bysort dnoutcome place_id (mix_it_up): gen low = (_n - 1) * 10 + 1
            gen high = low + 10 - 1
            
            * pair each case with the controls picked at random
            rangejoin pick_order low high using "controls.dta", by(dnoutcome place_id)
            
            sort case_id pick_order

            Comment


            • #7
              Wow, that is outstanding, Robert! A very clever application of -rangejoin-.

              Comment


              • #8
                I agree! Bravo!
                Last edited by Steve Samuels; 16 Jun 2016, 18:29.
                Steve Samuels
                Statistical Consulting
                [email protected]

                Stata 14.2

                Comment


                • #9
                  Thank you!!!! it worked! Thank you very much!

                  Comment


                  • #10
                    I was reading my solution in #6 and it dawned on me that the rangejoin part is not necessary. You can combine the cases and controls using a simple merge. All you need is to create a key merge variable in both datasets that follows the matching order, once the data has been jumbled in both dataset. In other words, within groups of dnoutcome place_id, the first random case is matched with the first 10 random controls, the second case with the next 20 controls, and so on.

                    Here's an example using the same fake data

                    Code:
                    set seed 1234321
                    
                    * generate cases with 5 possible outcomes from about 300 places
                    clear
                    set obs 27964
                    gen long case_id = _n
                    gen dnoutcome =  runiformint(1,5)
                    gen place_id = int(1 / runiform())
                    gen case_var1 = runiform()
                    save "cases.dta", replace
                    
                    * repeat for controls
                    clear
                    set obs 2900000
                    gen long control_id = _n
                    gen dnoutcome =  runiformint(1,5)
                    gen place_id = int(1 / runiform())
                    gen control_var1 = runiform()
                    save "controls.dta", replace
                    
                    * mix up the controls randomly within groups of dnoutcome place_id
                    * and assign a pick order and a -pick- var to use with -merge-
                    gen double mix_it_up = runiform()
                    bysort dnoutcome place_id (mix_it_up): gen pick_order = _n
                    gen long pick = int((pick_order - 1) / 10) + 1
                    save "controls2pick.dta", replace
                    
                    * mix up cases randomly within groups of dnoutcome place_id and
                    * assign a pick order
                    use "cases.dta", clear
                    gen double mix_it_up = runiform()
                    bysort dnoutcome place_id (mix_it_up): gen long pick = _n
                    merge 1:m dnoutcome place_id pick using "controls2pick.dta", keep(master match) nogen
                    sort case_id pick_order mix_it_up
                    
                    save "results_merge.dta", replace
                    This merge solution requires just a little over 3 seconds on my computer to jumble the order and produce the results. The rangejoin solution adds an extra 13 seconds because of the extra work needed to find observations within the specified range.

                    Both solution produce exactly the same results if the observations are jumbled the same way. I noticed that the example in #6 is not deterministic because of duplicate values in the mix_it_up variable. The fix is to create the mix_it_up variables as doubles. Here's the same example using rangejoin and a check that results match the results from the merge example above.

                    Code:
                    set seed 1234321
                    
                    * generate cases with 5 possible outcomes from about 300 places
                    clear
                    set obs 27964
                    gen long case_id = _n
                    gen dnoutcome =  runiformint(1,5)
                    gen place_id = int(1 / runiform())
                    gen case_var1 = runiform()
                    save "cases.dta", replace
                    
                    * repeat for controls
                    clear
                    set obs 2900000
                    gen long control_id = _n
                    gen dnoutcome =  runiformint(1,5)
                    gen place_id = int(1 / runiform())
                    gen control_var1 = runiform()
                    save "controls.dta", replace
                    
                    * mix up the controls randomly within groups of dnoutcome place_id
                    * and assign a pick order
                    gen double mix_it_up = runiform()
                    bysort dnoutcome place_id (mix_it_up): gen pick_order = _n
                    save "controls2pick.dta", replace
                    
                    * mix up cases randomly within groups of dnoutcome place_id and
                    * assign pick numbers using indexes (e.g. the first obs picks 1 to 10;
                    * the second obs picks 11 to 20, etc)
                    use "cases.dta", clear
                    gen double mix_it_up = runiform()
                    bysort dnoutcome place_id (mix_it_up): gen long low = (_n - 1) * 10 + 1
                    gen long high = low + 10 - 1
                    
                    * pair each case with the controls picked at random
                    rangejoin pick_order low high using "controls2pick.dta", by(dnoutcome place_id)
                    
                    sort case_id pick_order mix_it_up
                    save "results_rangestat.dta", replace
                    
                    cf dnoutcome place_id case_id case_var1 control_id control_var1 using "results_merge.dta", all

                    Comment


                    • #11
                      Hi
                      I just tested this last code in my file and I got 294 duplicates in the unique identifier for the controls.
                      I would like to know why.
                      And once more thank you for your help.
                      BW

                      Comment


                      • #12
                        I'm traveling so I don't have time to think much about this. It might have to do with precision. Make sure that all variables are double or long. I note that I forgot one case in the examples above. You should change:

                        Code:
                        bysort dnoutcome place_id (mix_it_up): gen pick_order = _n
                        to

                        Code:
                        bysort dnoutcome place_id (mix_it_up): gen long pick_order = _n

                        Comment


                        • #13
                          Hi, Not sure if this works for me. But I have a data set on a set of households with durable goods. I am trying to match each household (if they own a durable good) to to the sum of the (amount of durable goods they own). i.e creating an index. After doing this, I am restricting n to 1. This is the code I have used so far

                          bys case_id: egen index1=sum( hh_l03 )
                          . br case_id hh_l01 index1
                          . bys case_id: gen n=_n
                          by case_id, sort: keep if _n<=1

                          What is the purpose of the latter part of the command above. I will post it on here just to make it easier
                          mix up the controls randomly within groups of dnoutcome place_id * and assign a pick order and a -pick- var to use with -merge- gen double mix_it_up = runiform() bysort dnoutcome place_id (mix_it_up): gen pick_order = _n gen long pick = int((pick_order - 1) / 10) + 1 save "controls2pick.dta", replace * mix up cases randomly within groups of dnoutcome place_id and * assign a pick order use "cases.dta", clear gen double mix_it_up = runiform() bysort dnoutcome place_id (mix_it_up): gen long pick = _n merge 1:m dnoutcome place_id pick using "controls2pick.dta", keep(master match) nogen sort case_id pick_order mix_it_up

                          Comment


                          • #14
                            sorry this data looks rundown. very new to stata

                            Comment


                            • #15
                              Hi! Enny was helping me to match a case control analysis, and I used the codes mentioned above by Robert (using rangejoin), but unfortunately the code error "r(198) invalid syntax" appeared. Would you have any idea what might have happened?

                              I have my cases (8,676) in one dataset and my controls (916,405) in another one. I would like to match the cases based on birth date (birth_date) and municipality of residence (mun_code). Both files have these information and an unique identifier. I also will match 10 controls for each case and my controls sample have to be unique for each case.

                              Code:
                              use "controls.dta, clear"
                              gen double mix_it_up = runiform()
                              bysort birth_date mun_code (mix_it_up): gen pick_order = _n
                              save "controls_pick.dta"
                              
                              use "cases.dta, clear"
                              gen double mix_it_up = runiform()
                              bysort birth_date mun_code (mix_it_up): gen long low = (_n-1)*10+1
                              gen long high = low+10-1
                              
                              *kept using "cases.dta":
                              rangejoin pick_order low high using "controls_pick.dta", by(birth_date mun_code)
                                 (using rangestat version 1.1.1)
                              invalid syntax
                              r(198);

                              Comment

                              Working...
                              X