Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Greetings,

    In the above code for rangejoin, if I want to do 1:4 matching (1 case and 4 controls) what do I have to do? I did the following to create 1:1 matching. In the following code, in the highlighted text, if I do by id (shuffle), sort: keep if _n < 5, will that give me 1:4 matchung? thanks.

    preserve
    keep if Group == 0
    tempfile controls
    save `controls'
    restore
    keep if Group ==1
    rangejoin age -10 10 using `controls', by(Sex)
    set seed 1234
    gen double shuffle = runiform()
    by id (shuffle), sort: keep if _n == 1
    drop shuffle

    Comment


    • #17
      Yes it will. But you might want to do a bit better. That will simply select a random four out of all of the controls that agree with the case on Sex and come within 10 on age. But since I gather you are trying to match as tightly as possible, you might want to pick the four closest age matches rather than just four at random. So for that you could do:

      Code:
      gen delta_age = abs(age-U_age)
      by id (delta), sort: keep if _n < 5

      Comment


      • #18
        Thank you.

        Also, is there a way to do either 1:1 or 1:4 matching such that controls are selected without replacement based on age range and sex? That is if I want to count one patient only once be is a case or a control.

        Comment


        • #19
          Yes, it is possible. A downside is that you may end up with more cases that go unmatched altogether.

          Another downside is that the code to do this is somewhat complicated. Since there are very few analyses in which matching without replacement is statistically preferable to matching with replacement, I don't want to take the trouble to write it out unless you are 150% sure you really need it. It's a bigger deal in terms of code complexity, and also execution time. If you are being pressed by an advisor to do matching without replacement, another possibility to consider is matching without replacement on sex and age groups (not age range). The code for that is actually quite simple and runs very quickly compared to age-range matching (with or without replacement). Though again, you may find you have more unmatched cases in the end.

          From the perspective of controlling confounder bias in your analysis, you are probably best off with what you already have: matching on sex and nearest age with replacement.

          Added: Looking back at your earlier posts, it seems you have a very large separation between the cases and controls for the age distribution. This implies that, no matter what you do, many cases are going to fail to find any good match. Even if you got a perfect match everybody that could be matched, the number of cases and controls who are simply excluded by the matching could be as big a problem for the interpretability of your study findings as the confounding bias due to the age difference. Unless age is a very sensitive predictor of your outcome variable (whicih, if it's bloodloss during some procedures would surprise me), I think you should err on the side of getting more matches even if they are not so close. That would, in turn, argue in favor of matching with replacement.
          Last edited by Clyde Schechter; 22 Sep 2016, 15:45.

          Comment


          • #20
            Thank you so much for that explanation. I agree on matching with replacement but as you said I do have PI bugging me to to do matching without replacement. Just for my knowledge and for future reference what would the code be like if I use sex and age groups instead and try to do matching without replacement?

            Once again I am very grateful for your insight.

            Comment


            • #21
              For 1:1 age-group sex matching without replacement:
              Code:
              //    READ IN DATA FILE OF COMBINED CASES & CONTROLS
              use combined_cases_and_controls, clear
              set seed 1234 // OR YOUR FAVORITE SEED
              
              //    GENERATE AGE GROUPS (MODIFY LIMITS AS APPROPRIATE TO DATA)
              gen byte age_group = 1 if inrange(age, 25, 39)
              replace age_group = 2 if inrange(age, 40, 49)
              replace age_group = 3 if inrange(age, 50, 54)
              replace age_group = 4 if inrange(age, 55, 59)
              replace age_group = 5 if inrange(age, 60, 64)
              replace age_group = 6 if inrange(age, 65, 69)
              replace age_group = 7 if inrange(age, 70, .)
              
              gen double shuffle = runiform() // TO RANDOMIZE MATCH SELECTIONS
              
              //    FORM A FILE OF CONTROLS ONLY
              preserve
              keep if Group == 2
              //    ASSIGN A PRIORITY FOR MATCHING WITHIN EACH AGE_GROUP SEX COMBINATION
              by age_group Sex (shuffle), sort: gen int priority = _n
              drop shuffle
              //    RENAME VARIABLES TO AVOID CLASH
              rename * control_*
              foreach x in age_group Sex priority {
                  rename control_`x' `x'
              }
              tempfile controls
              save `controls'
              
              //    NOW MAKE A FILE OF CASES
              restore
              keep if Group == 1
              //    AGAIN PRIORITIZE FOR MATCHING
              by age_group Sex (shuffle), sort: gen int priority = _n
              drop shuffle
              //    MERGE WITH CONTROLS
              merge 1:1 age_group Sex priority using `controls', keep(master match)
              Note: you may need to use different limits to define your age groups so that you get decent numbers of matches in these categories. You need to look at the distributions of ages in both groups. For the range of ages that shows the most overlap between the groups, you can use narrower age bands, and for those ages with little overlap, use wide ones. This will give you a decent tradeoff between closeness of matching and getting matches at all.

              The above code is not tested: it may contain typos, punctuation errors, etc.

              Now, if you want 1:4 matching on age-groups and sex without replacement, it's only a bit more complicated. The difference is that in the controls, instead of assigning a unique priority for matching to each observation, you do that in batches of four. And the final merge becomes 1:m instead of 1:1.
              Code:
              //    READ IN DATA FILE OF COMBINED CASES & CONTROLS
              use combined_cases_and_controls, clear
              set seed 1234 // OR YOUR FAVORITE SEED
              
              //    GENERATE AGE GROUPS (MODIFY LIMITS AS APPROPRIATE TO DATA)
              gen byte age_group = 1 if inrange(age, 25, 39)
              replace age_group = 2 if inrange(age, 40, 49)
              replace age_group = 3 if inrange(age, 50, 54)
              replace age_group = 4 if inrange(age, 55, 59)
              replace age_group = 5 if inrange(age, 60, 64)
              replace age_group = 6 if inrange(age, 65, 69)
              replace age_group = 7 if inrange(age, 70, .)
              
              gen double shuffle = runiform() // TO RANDOMIZE MATCH SELECTIONS
              
              //    FORM A FILE OF CONTROLS ONLY
              preserve
              keep if Group == 2
              //    ASSIGN A PRIORITY FOR MATCHING WITHIN EACH AGE_GROUP SEX COMBINATION
              //    IN BATCHES OF (UP TO) FOUR
               by age_group Sex (shuffle), sort: gen int priority = floor((_n-1)/4) + 1
              drop shuffle
              //    RENAME VARIABLES TO AVOID CLASH
              rename * control_*
              foreach x in age_group Sex priority {
                  rename control_`x' `x'
              }
              tempfile controls
              save `controls'
              
              //    NOW MAKE A FILE OF CASES
              restore
              keep if Group == 1
              //    AGAIN PRIORITIZE FOR MATCHING
              by age_group Sex (shuffle), sort: gen int priority = _n
              drop shuffle
              //    MERGE WITH CONTROLS
              merge 1:m age_group Sex priority using `controls', keep(master match)


              Comment


              • #22
                Thank you.

                Comment


                • #23
                  Hi Clyde,
                  I'm reading through the code you posted in this thread on creating controls and comparing it to a response on the old Statalist, which I had used in the past for some analysis. I have reproduced the code below for your reference. The original post can be found here: http://www.stata.com/statalist/archi.../msg00326.html

                  Code:
                  clear
                  // mock up control data
                  set seed 846
                  set obs 500  // don't know how many controls you have
                  gen byte case = 0
                  gen byte age = 20 +ceil(65*runiform())  // broad age range assumed
                  tempfile controls
                  sort age
                  save `controls'
                  clear
                  // mock up cases
                  set obs 63
                  gen byte case = 1
                  gen byte age = 20 +ceil(65*runiform())
                  //
                  // The real stuff starts here; you have an existing control file you can append to your cases.
                  append using `controls'
                  gen rand = runiform()
                  sort age case rand
                  by age: egen ncases = sum(case)
                  keep if (ncases >=1) // age groups with no cases are irrelevant
                  //
                  // The following keeps the first 2 controls  for each case within each age group
                  by age: keep if (case ==1) | ((_n <= 2*ncases) & (case == 0))
                  tab2 age case
                  by age: egen ncontrols = sum(case == 0)
                  count if (ncontrols < 2*ncases)
                  I was wondering if you could comment on the differences between the method you posted earlier vs. the code I copied above as far as matching is concerned, or is they are just two different way at getting to the same thing? Thanks



                  Comment


                  • #24
                    At the detailed level, the code you show matches only on age, whereas the code I wrote matches on age group and sex. But, from a broader perspective, either approach will give a matching on the specified variables, randomly selecting from the controls without replacement. In my code, each case will appear in several observations, in each case linked to one matched control. In the code in #23, the linkage of a case to its matched controls is implicit in the order of the observations, but is not explicit in the data; if the data sort order is changed, the linkage will be lost.

                    Comment


                    • #25
                      Thanks, Clyde

                      Comment


                      • #26
                        For 1:1 matches, you may wish to take a look at the program - ccmatch - , written by Daniel Cook.
                        Best regards,

                        Marcos

                        Comment


                        • #27
                          Hi Clyde and Priyanka,

                          I am a novice Stata user (I have used SAS for years), and have been using the code provided in this feed to link controls to my cases by age and sex. Here is my code:

                          preserve
                          keep if diagnosis==0
                          tempfile controls
                          save controls

                          restore
                          keep if diagnosis==3

                          rangejoin age -0.5 0.5 using controls, by(sex)

                          set seed 217
                          gen double shuffle = runiform()
                          by id_num (shuffle), sort: keep if _n == 1
                          drop shuffle

                          My issue is right now is that I can't see if the matching has worked. If I look at the "Data Editor" I am only seeing my cases - does this mean it didn't work? Does this code produce a new dataset?

                          Thank you!
                          Last edited by Shannon Lange; 06 Sep 2017, 17:14.

                          Comment


                          • #28
                            Well, for one thing, your tempfile isn't being referred to properly. See corrections in code below. But I don't think that's the issue, because what you have shown should still work, the difference being that your working directory will have a file called controls.dta that you didn't intend to create.

                            Code:
                            preserve
                            keep if diagnosis==0
                            tempfile controls
                            save `controls'
                            
                            restore
                            keep if diagnosis==3
                            
                            rangejoin age -0.5 0.5 using `controls', by(sex)
                            
                            set seed 217
                            gen double shuffle = runiform()
                            by id_num (shuffle), sort: keep if _n == 1
                            drop shuffle
                            The code should leave you with the data in memory consisting of the original cases, and then with each case for which a match could be found there will be additional variables that are just like the variables in the cases, but prefixed with U_. These are the values of the variables for the matched control.

                            If you are not seeing that, then I think there may be a problem with your data. Are you share that your starting data set has both cases and controls in it. Try -count-ing them at the beginning, or just -tab diagnosis- and make sure that there really are 0's and 3's.. Another possibility is that in your data a 0.5 year radius for age is too stringent and there just aren't any controls to be found that fulfill that criterion. (This would particularly be the case if your age variable is an integer.)

                            If this advice doesn't help you solve the problem, then I think you should post back, using the -dataex- command (see FAQ #12 if you are not familiar with it) to show example data. Be sure your example includes some controls that you think should actually match to some of the cases in the example.

                            Comment


                            • #29
                              That worked! Thank you so much for your quick reply Clyde.

                              Comment


                              • #30
                                Another question...

                                Can you add another variable to this code that you wish to also match on? For instance, IQ scores within a range of 10. Can you add it to the line of code below as I have (which isn't quite working) or do you need an additional line of cade?

                                rangejoin age -0.5 0.5 IQ -10 10 using `controls', by(sex)

                                Comment

                                Working...
                                X