Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Guidance on procedures/commands for frequency matching

    Dear Statalisters,

    I have looked at Stata Journal archives, the Statalist forum, and other sources and have not been able to find specific guidance on the above. I believe there have been similar questions (e.g. http://www.stata.com/statalist/archi.../msg00009.html), but I don't think a response providing the commands have been provided. I must be missing something as this sampling method is used quite often in the literature, including when Stata is used for these purposes.

    Apologies for this basic question, but I would be grateful if anyone could share or point me to suitable Stata commands to perform frequency matching to sample controls or comparator groups (specifically in a cohort study, but I'd be grateful for advice on any study design).

    Thank you in advance for your time and kind consideration.

    Yours truly,

    Kareem

  • #2
    not sure what you mean by "frequency matching"; please define or give a citation

    Comment


    • #3
      Thanks for the reply.

      In response:

      Some definitions could be that of

      Rothman, KJ, et al. Modern Epidemiology. Third Edition. Lippincott, Williams and Wilkins. 2008. pg 171

      "Matching refers to the selection of a reference series - unexposed subjects in a cohort study or controls in a case-control study - that is identical, or nearly so, to the index series with respect to the distribution of one or more potentially confounding factors".

      or

      Aschengrau, A, Seage III, GR. Essentials of Epidemiology in Public Health. Third Edition. Jones and Bartlett Publishers. 2014. pg 302.

      "..Frequency matching is a type of category matching that balances the proportion of people with a confounding factor in the compared groups. For example, consider a cohort study in which exposed subjects had the following age distribution: 20% were aged 40 to 49 years, 40% were aged 50-59 years, 20% were 60-69 years and 20% were 70 years and older. Frequency matching would ensure that 20% of the unexposed subjects were aged 40-49 years, 40% were aged 50-59 years, and so on. Once investigators filled a particular age category, they would select no more unexposed individuals in that category. Frequency matching is less exact than individual matching, which makes it easier to find matches".

      Usage of this sampling approach in recent health literature include

      Schmidt, S., et al. Genes and Environmental Exposures in Veterans with Amyotrophic Lateral Sclerosis: The GENEVA Study Rationale, Study Design and Demographic Characteristics. Neuroepidemiology. 2008 May; 30(3): 191–204. (URL: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2645711/) :

      "The GENEVA study design aimed to frequency-match cases and controls on age, sex, race/ethnicity and use of the VA system for health care (for cases, prior to the date of their first ALS diagnosis)"

      or Karami, S., et al. Family history of cancer and renal cell cancer risk in Caucasians and African Americans. British Journal of Cancer (2010) 102, 1676–1680. doi:10.1038/sj.bjc.6605680

      "Controls selected from the general population were frequency matched to cases on age, race, sex, and study centre".

      Thank you.

      Kareem

      Comment


      • #4
        Well, as the reference you cite points out, frequency matching is typically part of the design of data collection, not a technique for analyzing data that has already been collected.

        Nevertheless, you could emulate it in the following way. Let's suppose you have two data sets, one is the exposed sample, and the other is the unexposed. I'll assume they are in two separate data files: exposed.dta and unexposed.dta. On the assumption that the exposed population is smaller, you will probably use them as the reference standard and try to select a sample of unexposed people whose distribution on some important covariates (I'll use age, suitably grouped, and sex as an illustration here, but they could be any discrete variables).

        Code:
        // CREATE REFERENCE JOINT DISTRIBUTION OF AGE AND SEX
        use exposed, clear
        contract age sex, freq(desired)
        
        // NOW BRING IN UNEXPOSED GROUP
        merge 1:m age sex using unexposed, keep(match master) nogenerate
        
        // NOW RANDOMLY SELECT desired NUMBER OF UNEXPOSED FROM EACH
        // AGE SEX COMBINATION
        set seed 1234 // OR ANY OTHER NATURAL NUMBER YOU LIKE
        gen double shuffle = runiform()
        by age sex (shuffle), sort: keep if _n <= desired
        by age sex: gen byte incomplete = (_N < desired)
        
        // CLEAN UP
        drop shuffle
        save frequency_matched_unexposed, replace
        Notes:

        1. If your data set has millions of observations, you should generate two random numbers, shuffle1 and shuffle2 for sorting the data, to assure a unique sort order. But if, as in most epidemiology studies, you are dealing with thousands or tens of thousands of observations, a single double precision random number will suffice.

        2. It is possible that there will be combinations of age-groups and sex for which there are insufficiently many matches (or even no matches) available in the unexposed data set. Those age-sex categries will show incomplete = 1 in the results above. You can then decide whether to accept that, or perhaps try again using a coarser age matching, or some other adjustment to the process.

        3. If you want to have, say, 3 unexposed for each exposed (instead of the 1:1 ratio produced by the above), just multiply desired by 3 before you do the random selection.

        4. Untested. Beware of typos.

        Comment


        • #5
          Thank you Clyde for your reply.

          Yes, frequency matching is part of the data collection stage.

          I'll certainly use the proposed commands to start with. Thank you.

          Kareem

          Comment


          • #6
            Dear Clyde,

            Am I correct to understand that the approach given will match at the individual level?

            e.g.
            if there are 100 male 30-35y.o. exposed CMs, (given a matching ratio of 1:1) 100 male, 30-35y.o. un-exposed CMs would be matched to the exposed CMs? i.e. the end result is to generate matching numbers of unexposed?

            Might it possible to match distributions (percentages or fractions) instead?

            e.g.
            % of exposed male 30-35y: 20; % of un-exposed male 30-35y: 20 (randomly selected)

            I believe this approach is supported in the literature:
            e.g. Szklo, M. Epidemiology: Beyond the basics. Third edition. Jones and Bartlett Learning.2014. pg 34:

            ".. For example, if matching is to be done according to gender and age (classified in two age groups, < 45 years and ≥ 45 years), four strata would be defined: females younger than 45 years, females aged 45 years, males younger than 45 years, and males aged 45 years or older. After the proportion of cases in each of these four groups is obtained, the number of controls to be selected from each gender-age stratum is chosen so that it is proportional to the distribution in the case group."

            Szklo suggests using "... stratified random sampling with the desirable stratum specific sampling fractions".

            I think it might be possible to achieve this in part with the following amendment to the early part of you commands above:

            Code:
            contract age sex, percent(desired)
            I would be grateful for additional advice in adjusting the remainder of your commands such that matching distributions are obtained.

            Thank you again.

            Kareem
            Last edited by Abdul-Kareem Abdul-Rahman; 13 Aug 2016, 15:08.

            Comment


            • #7
              As I said within note 3 in #4, the code as written produces a 1:1 exposed:unexposed result. If, however, you don't want the total numbers to be the same, you can modify the code by simply multiplying the variable desired by the appropriate proportion. So if you have 1000 exposed and you want to get 2000 unexposed, just add -replace desired = 2*desired- immediately after the -contract- command. If you only want 500 unexposed, -replace desired = 0.5*desired- will get you there. Well, almost: it may be that some of the age-sex strata in the exposed data set have an odd number of people. When you multiply that b 0.5 you will not get an integer, and so the code will, in effect, round down the number. But that should be just a minor deviation, and you can't create an exact match for the distribution no matter what you do in that situation.

              Your suggestion of using the -percent()- option in -contract- is not compatible with the approach I have given. Ultimately, a random sample of some integer number of people must be selected, and the mechanism I have set out there will land you with a total sample of 100 people (more or less--there will be truncation error just as with my example in the preceding paragraph) if you were to use the percent option in -contract-. Try it and you'll see. That's probably not what you want.

              The method I have outlined allows you complete freedom to set the total sample size and assures that the resulting sample will match the distribution of the reference group as closely as any sample of that size can.
              Last edited by Clyde Schechter; 13 Aug 2016, 15:11.

              Comment


              • #8
                Dear colleagues,

                I have developed a working set of commands to perform the desired tasks based on the helpful suggestions above.

                However, I note that the
                Code:
                runinform
                command used returns random numbers without replacement, from which we select our random sample.

                I'd now like to explore selecting a frequency-matched random sample as before, but this time with replacement.

                I have used the information on http://blog.stata.com/2012/08/29/usi...h-replacement/ on doing this, exemplified as follows:

                Code:
                    use cohorts/exp_contracted, clear // this is the exposed cohort
                
                    merge 1:m ageindexcat gender using cohorts/nonexp_cohort, keep(3) nogenerate // merge with the nonexp cohort
                   
                    set seed 1234 // the seed number
                    gen double shuffle1 = floor(_N*runiform()+1) // generate random number w replacement as per the guidance in the webpage cited above
                    gen double shuffle2 = floor(_N*runiform()+1)
                    
                    by ageindexcat gender (shuffle1), sort: keep if _n <= desired
                    
                    by ageindexcat gender: gen byte incomplete = (_N < desired) // combinations of age-groups and sex for which there are insufficiently many matches (or even no matches) available in the unexposed/comparator cohort
                However, with this method, although the 'shuffle' variables are provided with a random number with replacement, as per the example below:
                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input byte gender float ageindexcat long(desired patid) float cohort double(shuffle1 shuffle2) byte incomplete
                1 15 41460  1 1  25 374720 0
                1 15 41460  2 1  34 266503 0
                1 15 41460  3 1  72 397096 0
                1 15 41460  4 1  64 310321 0
                1 15 41460  5 1  74 304733 0
                1 15 41460  6 1  93  95311 0
                1 15 41460  7 1  34 266503 0
                1 15 41460  8 1 110 266333 0
                1 15 41460  9 1 113 178576 0
                1 15 41460 10 1 150 615659 0
                end
                label values cohort cohortlab
                label def cohortlab 1 "lifetime_abstainer", modify
                (where the shuffle variables 1 & 2 have repeated numbers assigned (ie 34 and 266503)), this same replacement is not performed on the patient, per se: that is, patient 1 is not repeated.

                What I would like is for the repeated sampling to be reflected in terms of the patients, such that the dataset might appear like this:
                Code:
                clear
                input byte gender float ageindexcat long(desired patid) float cohort double(shuffle1 shuffle2) byte incomplete
                1 15 41460 1 1  25 374720 0
                1 15 41460 2 1  34 266503 0
                1 15 41460 2 1  34 266503 0
                1 15 41460 3 1  72 397096 0
                1 15 41460 4 1  64 310321 0
                1 15 41460 5 1  74 304733 0
                1 15 41460 6 1  93  95311 0
                1 15 41460 7 1 110 266333 0
                1 15 41460 8 1 113 178576 0
                1 15 41460 9 1 150 615659 0
                end
                label values cohort cohortlab
                label def cohortlab 1 "lifetime_abstainer", modify
                Is there a way, with runinform or with other commands or combination of commands, where I can assign a repeated shuffle number to a duplicate of a patient(/s) where the number was first assigned (as per the example above, patient 2), and then to pull the set of shuffle numbers up to the location of the patient where the repeated shuffle number was last at (patient 7), and so on.
                I would be grateful for advice on how best to achieve this, or any other method to achieve a random sample of patients (and not only assignments of random numbers) with replacement.

                Thank you in advance for your continued support of this community.

                Yours truly,

                Kareem

                Comment


                • #9
                  If you want the same values of shuffle1 and shuffle2 to be assigned to all observations for the same patid, then you can do that as follows:

                  Code:
                  by patid, sort: gen double shuffle1 = floor(_N*runiform()+1) if _n == 1
                  by patid: gen double shuffle2 = floor(_N*runiform()+1) if _n == 1
                  by patid: replace shuffle1 = shuffle1[1]
                  by patid: replace shuffle2 = shuffle2[1]
                  and then to pull the set of shuffle numbers up to the location of the patient where the repeated shuffle number was last at (patient 7), and so on
                  I don't understand what this means.


                  Comment


                  • #10
                    I apologise if I have been unclear.

                    What I had meant is the reverse of your suggestion: i.e. I am thinking of how to generate duplicate/same observations for the same repeated sets of shuffle 1 and 2.

                    that is, instead of this, which happens currently:
                    Code:
                    clear
                    input byte gender float ageindexcat long(desired patid) float cohort double(shuffle1 shuffle2) byte incomplete
                    1 15 41460  2 1  34 266503 0
                    1 15 41460  7 1  34 266503 0
                    end
                    label values cohort cohortlab
                    label def cohortlab 1 "lifetime_abstainer", modify
                    we can get this:
                    Code:
                    clear
                    input byte gender float ageindexcat long(desired patid) float cohort double(shuffle1 shuffle2) byte incomplete
                    1 15 41460  2 1  34 266503 0
                    1 15 41460  2 1  34 266503 0
                    end
                    label values cohort cohortlab
                    label def cohortlab 1 "lifetime_abstainer", modify
                    Thank you.
                    Last edited by Abdul-Kareem Abdul-Rahman; 01 Dec 2016, 11:09.

                    Comment


                    • #11
                      Still not sure what you mean, but perhaps the following:

                      Code:
                      by shuffle1 shuffle2, sort: replace patid = patid[1]
                      The problem with this is that if you currently have multiple patid's for repeated values of shuffle1 and shuffle2, how do you know which value of patid you want? The above code does it arbitrarily, "randomly," and irreproducibly. If you had a rule, such as "use the smallest value of patid" then you could get that with:

                      Code:
                      by shuffle1 shuffle2 (patid), sort: replace patid = patid[1]

                      Comment


                      • #12
                        Can such replacement commands be implemented 'during' a runinform() process (and not performed after runinform() has been implemented in the whole dataset)?

                        That is, when runiform() assigns a repeated set of shuffle1&2 to an observation, runinform is 'paused', the observation is replaced with the observation that has the original set of shuffle1&2 previously, and once this is done, runiform() proceeds again?

                        Comment


                        • #13
                          There is no way I can think of to do what you are proposing here. -runiform()- generates a random number for every observation in the data set (unless constrained by -if- or -in- qualifiers). It cannot stop itself and examine whether the same values have come up before and then make adjustments to the data set.

                          But if it could do that, I don't see how the end result would be any different from the result of what I proposed in #11.

                          Comment


                          • #14
                            Dear Clyde,
                            I have a similar situation. If I want to do a frequency matching on race, sex, age +/- 5 years, and diagnosis time +/- 5 years, what STATA commands should I do? I think "contract" command cannot deal with that.
                            Thanks for your help.
                            Best,

                            Comment


                            • #15
                              I'm not sure what you mean by "frequency matching"; you show match criteria that are exact for certain variables and within a caliper for others; I know of two user-written commands that might do what you want; -calipmatch- is for matching without replacement while -vmatch- is for matching with replacement; use -search- to find and download:
                              Code:
                              search calipmatch
                              search vmatch

                              Comment

                              Working...
                              X