Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Selecting Controls based on time from admission to event

    Hello,

    I have a dataset of unique observations that have been grouped according to whether the participant had an event (Event = Yes) or did not have an event (Event = No). I have too few events in the dataset compared to controls so I would like to keep all the cases, N = 5 in the dataset below. For each case, I would like to select only the 2 closest controls based on the time from admission to event (So for example, case 1 closest match based on admission to having the event will be control 16 and 23). I tried to use Psmatch for this but it gets rid of my cases or requires that I match on other factors. Thank you in advance.


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id int(Admitdate eventdate) str3 Event float time
     1 20089 20106 "Yes"  17
     2 20577 20665 "No"   88
     3 20226 20228 "Yes"   2
     4 20227 20229 "Yes"   2
     5 20972 20982 "No"   10
     6 20281 20315 "No"   34
     7 20820 20911 "No"   91
     8 20211 20240 "No"   29
     9 20226 20237 "No"   11
    10 20241 20254 "No"   13
    11 20270 20272 "Yes"   2
    12 20298 20344 "No"   46
    13 20089 20097 "No"    8
    14 20211 20342 "Yes" 131
    15 20226 20230 "No"    4
    16 20089 20108 "No"   19
    17 20577 20590 "No"   13
    18 20226 20233 "No"    7
    19 20227 20228 "No"    1
    20 20607 20633 "No"   26
    21 20089 20152 "No"   63
    22 20211 20278 "No"   67
    23 20592 20607 "No"   15
    24 20241 20350 "No"  109
    end
    format %tdnn/dd/CCYY Admitdate
    format %tdnn/dd/CCYY eventdate

  • #2
    Code:
    //  SEPARATE THE CANDIDATE CONTROLS INTO A SEPARATE FILE
    preserve
    keep if Event == "No"
    rename (*) control_=
    tempfile controls
    save `controls'
    
    //  BRING BACK THE ORIGINAL DATA AND KEEP ONLY THE CASES
    restore
    keep if Event == "Yes"
    
    //  CROSS THE CASES WITH THE CONTROLS
    cross using `controls'
    
    //  KEEP THE TWO NEAREST FOR EACH CASE
    gen delta = abs(time-control_time)
    by id (delta), sort: keep if _n <= 2
    is a short, direct way to do this.

    Now, if your real data set is very large, this approach may fail due to memory limitations (as well as taking a very long time). The reason is that if you have M cases and N controls, then the -cross- command will create a data set containing M*N pairs of cases and candidate controls. So if you have a data set large enough that this will push the limits of your hardware, the following approach is more memory sparing, never expanding the data set beyond max(N, 2*M) observations.

    Code:
    // SEPARATE THE CANDIDATE CONTROLS INTO A SEPARATE FILE
    preserve
    keep if Event == "No"
    rename (*) control_=
    tempfile controls
    save `controls'
    
    // BRING BACK THE ORIGINAL DATA AND KEEP ONLY THE CASES
    restore
    keep if Event == "Yes"
    gen control_file = `"`controls'"'
    
    // PROGRAM TO MATCH ONE CASES
    capture program drop match_one
    program define match_one
    local filename = control_file[1]
    cross using `filename'
    gen delta = abs(time - control_time)
    sort delta
    keep in 1/2
    exit
    end
    
    runby match_one, by(id)
    drop control_file
    To use this code, you must install -runby-, written by Robert Picard and me, available from SSC.

    Finally I will point out that there can be situations where the same two controls are the closest ones for different cases, and so get used as controls repeatedly. Depending on what you plan to do analytically, this usually isn't a problem, but it could be. If you need a different algorithm that prohibits reuse of the same controls, post back. Bear in mind that if you ban the re-use of the controls, you will start to have cases matched to controls that are no longer closest, and, if the problem arises enough, they may in fact be very poor matches.

    Comment


    • #3
      Thank you Clyde for the fast response and for the solution. I am having issues with this code because I had previously truncated the data but now it seems I have to use the original dataset below. Both you and nick helped me out with the code for the previous issue. For the dataset below, which is the original form of the data, I have duplicates of both the cases and the controls which were classified as the event Y/N based on having two consecutive result values of >=20 and then I dropped the duplicates.

      For the second part of matching the controls, Can you show me a modified version of your 2nd code with regards to having duplicates of both the cases and controls (The first code did not work due to the space issue you mentioned, I have some 270,000 observations with duplicates).

      I am okay with having the same controls for a case, I would not need unique controls for every case.

      Thank you in advance.

      ----------------------- copy starting from the next line -----------------------
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte id int(Admitdate eventdate) byte Result str3 Event
      1 20089 20106 20 "Yes"
      1 20577 20665 20 "Yes"
      1 20226 20228 35 "Yes"
      1 20227 20229 28 "Yes"
      2 20972 20982 33 "Yes"
      2 20281 20315 32 "Yes"
      2 20820 20911 12 "Yes"
      2 20211 20240 11 "Yes"
      2 20226 20237 13 "Yes"
      3 20241 20254 12 "No" 
      3 20270 20272 12 "No" 
      3 20298 20344 12 "No" 
      3 20089 20097 10 "No" 
      3 20211 20342 25 "No" 
      4 20226 20230 28 "Yes"
      4 20089 20108 31 "Yes"
      4 20577 20590 12 "Yes"
      4 20226 20233 14 "Yes"
      4 20227 20228 27 "Yes"
      5 20607 20633 10 "No" 
      5 20089 20152 20 "No" 
      5 20211 20278 12 "No" 
      5 20592 20607 13 "No" 
      5 20241 20350 16 "No" 
      end
      format %tdnn/dd/CCYY Admitdate
      format %tdnn/dd/CCYY eventdate

      Comment


      • #4
        I don't think I understand what the problem is. The example data shown in #3 differs from that in #1 in that now there are multiple observations per ID. So I don't know what you want to do about that. Also, at least in the example of #3, any ID that is ever marked as a case, is always marked as a case, and ditto for the controls. But it isn't clear what kind of matching you now want to do. The various observations for any given case or control differ on the dates shown. Do you now want to match each observation for a given case with some controls, or do you want to somehow select just one observation for a case and find the two matches for that. If the latter, by what criterion do you decide which case to choose? Or maybe it is something else altogether? I need a clearer explanation of what you want.

        Comment


        • #5
          Thank you Clyde. I was able to rectify this situation by keeping the unique ID's of the dataset since we were more interested in the first result date. I then dropped all duplicates and applied the command specified above and all worked great. Thank you.

          Comment

          Working...
          X