Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep groups of data when the 1st ob of the group meets criteria

    Dear all,
    Been learning so much from this forum and finally have my own inquire to share.
    My objective is to select (use -keep- or -drop- or whatever works) a subset of data that: 1) find==1 and there is no duplicates according to id (like observation no.5 with id==3) ; 2)find ==1 and there are duplicates according to id, then I want to keep that "group" (like the obs with id==1, they have find==0 | find==1, but as long as one id==1 have id==1 I need to keep all the obs with id==1. Same goes for obs with id==4). I want to drop id==2| id==5 | id==6 |id==7|id==8 because obs with these ids are either single and with find ==0 or have dups but none in the id group have find==1.
    So I hope there is something that can help me keep ob no.1-3, 5-7 and dump the others. I know this may sound unnecessarily complicated but hopefully there is a way. Thanks!

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(find id)
    0 1
    1 1
    1 1
    0 2
    1 3
    1 4
    0 4
    0 4
    0 5
    0 6
    0 7
    0 8
    0 8
    end
    Best,
    Ginny

  • #2
    I find your explanation very confusing. I have seized on

    I want to drop id==2| id==5 | id==6 |id==7|id==8 because obs with these ids are either single and with find ==0 or have dups but none in the id group have find==1.
    and written code that will do that. I really can't figure out whether this is what you want or not.

    Code:
    assert inlist(find, 0, 1)
    by id, sort: drop if _N == 1 & find == 0 // SINGLETONS WITH find == 0
    by id: egen has_find_1 = max(find) // IDENTIFY ID's WHERE SOME find == 1
    by id: drop if _N > 1 & !has_find_1
    It does have the effect of dropping id's 2, 5, 6, 7, and 8, and only those.

    Note: Assumes that find is always 0 or 1, and verifies this assumption in the first line.

    If this is not what you intend, please post back with a different explanation.

    Comment


    • #3
      Hi Mr Schechter,
      Thanks very much for your reply! That's an elegant solution and it did help with my problem. I am sorry for the confusing description but you interpreted it perfectly. You are also correct to assume that find is binary.
      I realized what I wrote in the yesterday's post did not correspond to my title since I didn't say anything regarding "the 1st observation in the group". Well, I put together another data example and hopefully I can give you the full picture of my issue.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(find id time stay)
      1  1  1 1
      1  2  2 1
      1  2  3 1
      1  3  4 1
      0  3  5 1
      0  4  6 0
      1  4  7 1
      0  5  8 0
      0  5  9 0
      1  6 10 1
      1  6 11 1
      1  6 12 1
      1  7 13 1
      0  7 14 1
      0  7 15 0
      1  8 16 1
      0  8 17 1
      1  8 18 1
      0  9 19 0
      0  9 20 0
      0  9 21 0
      0 10 22 0
      1 10 23 1
      0 10 24 1
      0 11 25 0
      1 11 26 1
      1 11 27 1
      1 12 28 1
      1 12 29 1
      0 12 30 1
      0 13 31 0
      0 13 32 0
      1 13 33 1
      0 14 34 0
      end
      Code:
      * stay=1 means we should keep the observation. I created this variable to test if the codes are doing their job.
      tab stay
      * This part is written by Mr Schechter. These lines keep the singletons with find==1 and the groups (_N>1) with find==1 observation(s).
      by id, sort: drop if _N == 1 & find == 0
      by id: egen has_find_1 = max(find)  // This is a great line. I never know one can generate variables with -max()- like this.
      by id: drop if _N > 1 & !has_find_1
      
      *I edited the following lines. The basic idea is that singletons with find=1 stays all the time. But for groups things are different. For detailed explanation please see after the code.
      by id: gen first = find - find[_n-1]
      by id: egen has_find_0 = min (find)
      sort id time
      gen first1=.
      by id: replace first1= 1 if find== 1 
      by id: replace first1= 1 if has_find_0 == 0 & find[1]==1 & _N>1
      replace first1 =1 if has_find_0 == 0 & first==-1
      replace first1 =. if has_find_0 == 0 & first==0 & find== 0
      drop if first1 ==.
      tab stay
      The short version of my intention is to find the codes that can differentiate the observations like the variable stay does.
      The long version is here: Imagine we are calculating readmission rate. Patients are identified through ids. If an id appears more than once in a data set with 3-month of inpatient data, the patient with that id got re-admitted in the 3 month (some people can get hospitalized more than 2, 3, 4 times). "Find=0/1" is an indicator of whether the patient got into the hospital due to a problem of interest (find=1) for example heart failure, or other issues (find=0). Now I am calculating the readmission rate for heart failure. That means I need to find the patient admitted for heart failure and the ones followed with a re-admission because of whatever reason (which include heart failure, with find=1). I started with find all the patients got hospitalized due to heart failure and all the patients who got into the hospital for more than once in a database with all the admissions in the 3 months. Notice that patient that got into the hospital for more than 1 times may have nothing to do with heart failure. So yesterday I asked for codes that "keep the singletons with find==1 and the groups (_N>1) with find==1 observation(s)". In that way I located all the patients got admitted once for heart failure and patient with repeated admissions but at least one admission concerns with heart failure. Yesterday's codes have done that.
      But now the new problem is to find the re-admissions that followed an admission for heart failure. In order to show the sense of order, I introduced a new variable called time. It's actually the time when patient got into the hospital. The patient might get hospitalized for many times but I try to show a maximum of 3 hospitalizations. So if patient got in twice, they could got in as id 2-4 showed. Patient with id 3 got in due to heart failure the first time and then got in for something else the second time. We would then need 2 obs of patient 3 because we are looking for readmission due to whatever reason after an admission with find=1). But patient with id 4 got in the first time due to something else so we can't use this ob when we calculate the "readmission rate for heart failure". But we do need the patient's second admission for the denominator of the readmission rate (I am trying to explain the logic to code stay (0/1)). So if patient got in for 3 times, they could got in as id 6-13 showed. I wish you can get the idea of that by looking at the value for the stay variable.
      It took me like hours to write up and test the lines and as far as I can see they worked. But I think it is not a very efficient solution. Plus it's based on my very limited knowledge of STATA and data analysis. So I took the time to explain my logic and objectives hoping to get some advice on how to improve the current codes and what the other possibles ways are to approach a problem like this.
      Thank you all for bearing with me.

      Best,
      Ginny

      Comment


      • #4
        The process of calculating the proportion of patients readmitted (for any reason) after an admission with find == 1 can be done as follows:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(find id time stay)
        1  1  1 1
        1  2  2 1
        1  2  3 1
        1  3  4 1
        0  3  5 1
        0  4  6 0
        1  4  7 1
        0  5  8 0
        0  5  9 0
        1  6 10 1
        1  6 11 1
        1  6 12 1
        1  7 13 1
        0  7 14 1
        0  7 15 0
        1  8 16 1
        0  8 17 1
        1  8 18 1
        0  9 19 0
        0  9 20 0
        0  9 21 0
        0 10 22 0
        1 10 23 1
        0 10 24 1
        0 11 25 0
        1 11 26 1
        1 11 27 1
        1 12 28 1
        1 12 29 1
        0 12 30 1
        0 13 31 0
        0 13 32 0
        1 13 33 1
        0 14 34 0
        end
        
        //    IDENTIFY READMISSIONS AFTER A FIND == 1 ADMISSION
        xtset id time
        by id (time): gen byte temp = sum(find)
        by id (time): gen byte readmitted_after_find = cond(_n > 1, !!L1.temp, 0)
        drop temp
        
        //    IDENTIFY EVER BEING ADMITTED WITH FIND == 1
        by id (time): egen byte ever_find_1 = max(find)
        
        //    IDENTIFY EVER BEING READMITTED AFTER FIND == 1
        by id (time): egen byte ever_readmitted_after_find = max(readmitted_after_find)
        
        //    FLAG ONE OBSERVATION PER ID
        by id (time): gen byte id_flag = (_n == 1)
        
        //    CALCULATE NUMERATOR, DENOMINATOR, AND PROBABILITY
        summ ever_find_1 if id_flag, meanonly
        local denominator `r(sum)'
        summ ever_readmitted_after_find if id_flag, meanonly
        local numerator `r(sum)'
        display "Probability or admission after find: " =`numerator'/`denominator'
        Now, in your explanation above, you refer on one occasion to readmissions within three months. I don't see a variable in your data that indicates how many months have elapsed: your time variable is simply a sequence number and does not appear to convey any information about the amount of time between successive admissions.

        Comment


        • #5
          Thank you very much Mr. Schechter. Sorry that I did not reply sooner.
          1. I have difficulty understanding this line you wrote above:
          Code:
          by id (time): gen byte readmitted_after_find = cond(_n > 1, !!L1.temp, 0)
          Not sure how it worked.
          2. I conferred with my colleagues and realized that I made a mistake in understanding readmission rate for specific disease/procedure. The denominator is all the admissions caused by a diagnosis/ procedure and the numerator is the admissions followed by a re-admission within the denominator admissions. So for this specific question, what I should be looking for a variable that can flag whether an admission is followed by an readmission in the entire database and then keep the obs with the right disease/ procedure code.
          3. I also realized later that time is actually important. Since I can't display the real database here, I actually composed the data used above as examples from ground up to illustrate my questions. For simplicity I made time a sequence number and the data set appear to be the result of "sort id time". In real situation the database have the date of the patient admission and discharge, thus can "convey information about the amount of time between successive admissions" like you said.
          4. After all we thought the code for readmission rate for specific disease/procedure should be like:
          Code:
          egen id = group( id)
          * if the period of time between first discharge and second admission is less than or equal to 31 days, readm31=1
          bys id:gen readm31=1 if DATE_INHOSPITAL[_n+1]-DATE_DISCHARGE<=31
          recode readm31 .=0
          keep if find==1
          I tried to use your way to store the results and calculate the rate but it doesn't work for me. But the idea is:
          Code:
          * The denominator is all the obs with find==1 (after "keep if find==1", that means all the obs)
          count if find==1
          local denominator `r(count)'  
          count if readm31==1
          local numerator `r(count)'
          display "readmission rate of find within 31 days:" =`numerator'/`denominator'
          (I don't know if it's because r(count) is different from r(sum)? )
          5. But the above code only worked when one wants to know the "admission rate of specific disease within 31 days". Went back a couple of days, I wanted to know what the re-admissions are like (how many days after the first discharge and why did they get in the second time), so these lines won't help much. The "keep if find==1" basically eliminate all the observations that come back after the first admission with find==1 as well as other irrelevant obs. I took a detour in trying to find out "what are the patient getting readmitted for (whether it's the previous disease or other diseases)?". Honestly I haven't thought through how I should approach that question and how to interpret the answer.
          6. I apologize for any unnecessary troubles and my not-so-clear questions, and thanks for all the instructive feedback.

          Comment


          • #6
            Let me explain the code -by id (time): gen byte readmitted_after_find = cond(_n > 1, !!L1.temp, 0)-.

            First, the data must be sorted by id and by time within id. Now, in general -cond(logical_expression, expression_t, expression_f)- is a function which tests whether the logical expression ins true, and if so it returns expression_t, and otherwise it returns expression_f. In this case, the condition is _n > 1. So in the first observation of each group of observations for an id, the condition is false and the readmitted_after_find will be set to 0. In all other observations in the id group, it will be set to !!L1.temp. Now, what is !!L1.temp? The data had previously been -xtset id time-, so L1.temp refers to the value of temp at the preceding time. That is, if this observation has time = 2, L1.temp means the value of temp in the same id's observation with time = 1. And what is !! ? The value of temp had been created as a running sum of the value find within the id group. That is, in any given observation, temp is the count of observations up to and including the present one which have find = 1 (sum = count of 1's because find is a 0/1 variable). ! is Stata's logical not operator. Remember that in Stata, logical values are represented by 0 = false and anything else = true. So if you work it through you will see that !!x will be 0 when x = 0, and 1 whenever x != 0 (including x missing). So !!L1.temp will be 0 if there have been no observations with find = 1 up to and including the previous time period, and it will be 1 otherwise (i.e. if there has been 1 or more observations with find = 1 up to and including the previous time period).

            The first lien of code in paragraph 4 of post #5 is incorrect, and I find it hard to believe it did not trigger an error message. You can't have -egen id = group(id)- because to appear left of the equals sign, variable id must not yet exist, but to appear right of the equals sign it must already exist. The creation of variable readm31 can be reduced to a single line:
            Code:
            bys id:gen readm31 = (DATE_INHOSPITAL[_n+1]-DATE_DISCHARGE<=31)
            If find is a 0/1 variable, `r(N)' after -count if find == 1- and `r(sum)' after -summ find- will always be the same. If they disagree, then there must be some observation where find is something other than 0 or 1 (or missing value). And if that is the case, you have uncovered a problem that you must fix because so much of your code depends on the integrity of that variable. Run -browse if !inlist(find, 0, 1) & !missing(find)- to see the offending observations.


            Comment


            • #7
              Thanks for the detailed explanation Mr. Schechter.
              1. -egen id = group(id)- is wrong. Sorry. It should be like -egen id = group(patient_id)-.
              2. It really should be -r(N)- instead of -r(count)-. Guess -r(count)- is an invalid syntax.
              3. I was inspired by your solution to my question in Post #3 and wrote some new lines to tackle this whole "why did they get in the second time" problem.
              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input float(find id time stay)
              1  1  1 1
              1  2  2 1
              1  2  3 1
              1  3  4 1
              0  3  5 1
              0  4  6 0
              1  4  7 1
              0  5  8 0
              0  5  9 0
              1  6 10 1
              1  6 11 1
              1  6 12 1
              1  7 13 1
              0  7 14 1
              0  7 15 0
              1  8 16 1
              0  8 17 1
              1  8 18 1
              0  9 19 0
              0  9 20 0
              0  9 21 0
              0 10 22 0
              1 10 23 1
              0 10 24 1
              0 11 25 0
              1 11 26 1
              1 11 27 1
              1 12 28 1
              1 12 29 1
              0 12 30 1
              0 13 31 0
              0 13 32 0
              1 13 33 1
              0 14 34 0
              end
              *Find the obs with find==1 and the obs have a previous ob with find==1
              sort id time
              by id: gen true=1 if find[_n]== 1
              by id: replace true=1 if find[_n-1]== 1
              drop if true==.
              
              *Differentiate readmission for the same problem (find[_n+1]== 1) or a different problem (find[_n+1]== 0) among the find==1 observations.
              by id: gen readm=1 if find[_n+1]== 1 & find==1
              by id: replace readm=2 if find[_n+1]== 0 & find==1
              recode readm .=0
              
              count if find==1
              local denominator `r(N)'
              count if find==1 & readm==1
              local numerator `r(N)'
              display "admission rate of readmitted for the same problem" =`numerator'/`denominator'
              
              count if find==1
              local denominator `r(N)'
              count if find==1 & readm==2
              local numerator `r(N)'
              display "admission rate of readmitted for a different problem" =`numerator'/`denominator'
              I guess this can be the solution to my initial problem. Feel free to point out anything redundant or wrong. It's quite difficult to fully understand and use the system variables but they are immensely useful.

              Comment


              • #8
                This all looks right, and I don't see any opportunities to improve it.

                Comment


                • #9
                  Thank you very much for your help sir!

                  Comment

                  Working...
                  X