Estimating hospital mortality with recurring patients and duplicates

Gianfranco Di Gennaro

Join Date: Nov 2020

Posts: 135
#1

Estimating hospital mortality with recurring patients and duplicates

22 Feb 2025, 01:42

Dear all,

I have a data management issue in an epidemiological study where the goal is to estimate mortality rate in a hospital across different time periods, specifically from May to December for the years 2020, 2021, 2022, 2023, and 2024.
For each of these years, I have a list of patients who passed through the hospital, organized in different Excel sheets.

To do this, for each patient I would have to calculate the days of hospitalization in each department and his/her status upon discharge (alive or dead).

There are some complications:
A patient may appear in multiple years. For example, a patient admitted on November 17, 2018, and discharged on March 3, 2023, appears in the data for 2020, 2021, 2022, and 2023.

A patient might have been admitted and discharged multiple times. For instance, the aforementioned patient, discharged on March 3, 2023, could have been readmitted on June 7, 2023, or in the following year (e.g., February 2, 2024), and so on.

In addition to the patient ID, admission and discharge dates, I also know the discharge status for each admission (deceased or NOT deceased).

When I combine the excel sheets, the data has the following structure:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input int patient_id str14 hospital_unit str10(admission_date discharge_date) byte deceased int excelyearsheet 101 "Day Hospital" "15/06/2020" "25/06/2020" 0 2020 102 "Hospice" "10/08/2021" "30/08/2021" 0 2021 103 "Cardiology" "20/11/2022" "15/01/2023" 1 2022 103 "Cardiology" "20/11/2022" "15/01/2023" 1 2023 104 "Unit A" "05/06/2022" "15/06/2022" 0 2022 104 "Unit A" "20/10/2022" "30/10/2022" 1 2022 105 "Oncology" "10/07/2020" "20/07/2020" 0 2020 105 "Oncology" "15/12/2022" "05/01/2023" 0 2022 105 "Oncology" "15/12/2022" "05/01/2023" 0 2023 106 "Day Hospital" "12/05/2023" "22/05/2023" 0 2023 107 "Unit B" "25/11/2021" "05/01/2022" 0 2021 107 "Unit B" "25/11/2021" "05/01/2022" 0 2022 107 "Unit B" "10/07/2023" "20/07/2023" 0 2023 108 "Day Hospital" "05/06/2024" "15/06/2024" 0 2024 109 "Intensive care" "18/08/2020" "30/08/2020" 1 2020 110 "Cardiology" "28/12/2023" "10/01/2024" 0 2023 110 "Cardiology" "28/12/2023" "10/01/2024" 0 2024 111 "Day Hospital" "12/09/2021" "22/09/2021" 0 2021 112 "Unit C" "07/05/2022" "17/05/2022" 0 2022 112 "Unit C" "12/11/2022" "22/11/2022" 0 2022 113 "Hospice" "15/07/2024" "25/07/2024" 1 2024 114 "Oncology" "20/10/2023" "30/10/2023" 0 2023 115 "Unit D" "05/08/2021" "15/08/2021" 0 2021 115 "Unit D" "02/02/2024" "12/02/2024" 0 2024 221 "Day Hospital" "23/11/2023" "06/02/2024" 0 2023 221 "Day Hospital" "23/11/2023" "06/02/2024" 0 2024 221 "Hospice" "12/2/2023" "23/04/2024" 1 2024 end

In a previous study, the invaluable @ClydeSchechter provided me with the tools to obtain one record per patient, per month, and per department, in order to estimate monthly incidences. The method worked perfectly. It was like this:

Code:

gen start_month = mofd(admission_date) gen end_month = mofd(discharge_date) expand end_month-start_month + 1 by patient_id, sort: gen mdate = dofm(start_month + _n - 1) format *month %tm format mdate %td gen admission_days = min(floor(discharge_date), lastdayofmonth(mdate)) /// - max(admission_date, firstdayofmonth(mdate)) by patient_id (mdate), sort: replace deceased = 0 if _n < _N // CAN ONLY DIE IN LAST MONTH

I would like to apply this method again.
However, in that case, the situation was less complex because the data was relative to a single year, 2020. In this case, I have the problem of all the duplicates that are created when a patient appears with the same admissions in the Excel sheets of different years (e.g., patients who "died twice," etc.). Since we are talking about thousands of patients, checking them individually is impossible.

I wonder if you have any suggestions regarding this issue. I am unsure whether it would be better to remove admissions that appear in multiple sheets beforehand or to eliminate the duplicates that arise after processing.

Thank you all.
Gianfranco
Tags: data management, epidemiology
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#2

22 Feb 2025, 10:04

Based on the example data shown, it seems that the duplicate records are exact duplicates with the sole exception of the variable excelyearsheet. If that is true in the entire data set, then you could do this as:
[code]
ds excelyearsheet, not
by `r(varlist)', sort: keep if _n == 1
drop excelyearsheet
[code]

This is a conservative approach: if there are variables other than id, hospital_unit, admission_date, discharge_date, and deceased, and if some of them differ according to which excelyearsheet they come from, then all of those versions will be preserved. What is really needed is to first determine whether the full data set does contain other variables that can and should vary across excelyearsheet for the same admission (e.g. billing or administrative variables might be in this category). In that case you should -drop- those variables first and then proceed with the above code.

I would do this before attempting to calculate monthly incidence rates.
Comment

Gianfranco Di Gennaro

Join Date: Nov 2020
Posts: 135

22 Feb 2025, 11:49

Thank you, Clyde Schechter , for your help. I have the following variables: name, surname, sex, Hospital_Area (which is directly linked to hospital_unit), and born_date (which I will use to calculate age in the final dataset before computing the incidence rate) (see below).

The objective is to calculate the incidence across different hospital areas or units, stratified by three age groups. However, I repeat, in my opinion, I can also create the age groups AFTER removing duplicates (correct?).

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int patient_id str14 hospital_unit str10(admission_date discharge_date) byte deceased int excelyearsheet str1 sex str10 born_date str5 hospital_area str10 name str11 surname
101 "Day Hospital"   "15/06/2020" "25/06/2020" 0 2020 "M" "12/03/1965" "North" "Mickey"     "Mouse"      
102 "Hospice"        "10/08/2021" "30/08/2021" 0 2021 "F" "22/07/1958" "South" "Minnie"     "Mouse"      
103 "Cardiology"     "20/11/2022" "15/01/2023" 1 2022 "M" "05/11/1947" "East"  "Donald"     "Duck"       
103 "Cardiology"     "20/11/2022" "15/01/2023" 1 2023 "M" "05/11/1947" "East"  "Donald"     "Duck"       
104 "Unit A"         "05/06/2022" "15/06/2022" 0 2022 "F" "18/09/1972" "West"  "Daisy"      "Duck"       
104 "Unit A"         "20/10/2022" "30/10/2022" 1 2022 "F" "18/09/1972" "West"  "Daisy"      "Duck"       
105 "Oncology"       "10/07/2020" "20/07/2020" 0 2020 "M" "30/01/1980" "North" "Goofy"      "Goof"       
105 "Oncology"       "15/12/2022" "05/01/2023" 0 2022 "M" "30/01/1980" "North" "Goofy"      "Goof"       
105 "Oncology"       "15/12/2022" "05/01/2023" 0 2023 "M" "30/01/1980" "North" "Goofy"      "Goof"       
106 "Day Hospital"   "12/05/2023" "22/05/2023" 0 2023 "F" "14/04/1995" "South" "Pluto"      "Dog"        
107 "Unit B"         "25/11/2021" "05/01/2022" 0 2021 "M" "08/12/1969" "East"  "Scrooge"    "McDuck"     
107 "Unit B"         "25/11/2021" "05/01/2022" 0 2022 "M" "08/12/1969" "East"  "Scrooge"    "McDuck"     
107 "Unit B"         "10/07/2023" "20/07/2023" 0 2023 "M" "08/12/1969" "East"  "Scrooge"    "McDuck"     
108 "Day Hospital"   "05/06/2024" "15/06/2024" 0 2024 "F" "25/05/1988" "West"  "Clarabelle" "Cow"        
109 "Intensive care" "18/08/2020" "30/08/2020" 1 2020 "M" "03/02/1950" "North" "Pete"       "Black"      
110 "Cardiology"     "28/12/2023" "10/01/2024" 0 2023 "F" "19/06/1975" "South" "Huey"       "Duck"       
110 "Cardiology"     "28/12/2023" "10/01/2024" 0 2024 "F" "19/06/1975" "South" "Huey"       "Duck"       
111 "Day Hospital"   "12/09/2021" "22/09/2021" 0 2021 "M" "11/10/1983" "East"  "Dewey"      "Duck"       
112 "Unit C"         "07/05/2022" "17/05/2022" 0 2022 "F" "27/04/1990" "West"  "Louie"      "Duck"       
112 "Unit C"         "12/11/2022" "22/11/2022" 0 2022 "F" "27/04/1990" "West"  "Louie"      "Duck"       
113 "Hospice"        "15/07/2024" "25/07/2024" 1 2024 "M" "09/08/1955" "North" "Max"        "Goof"       
114 "Oncology"       "20/10/2023" "30/10/2023" 0 2023 "F" "01/03/1987" "South" "Webby"      "Vanderquack"
115 "Unit D"         "05/08/2021" "15/08/2021" 0 2021 "M" "15/07/1978" "East"  "Launchpad"  "McQuack"    
115 "Unit D"         "02/02/2024" "12/02/2024" 0 2024 "M" "15/07/1978" "East"  "Launchpad"  "McQuack"    
221 "Day Hospital"   "23/11/2023" "06/02/2024" 0 2023 "F" "20/12/1962" "West"  "Magica"     "De Spell"   
221 "Day Hospital"   "23/11/2023" "06/02/2024" 0 2024 "F" "20/12/1962" "West"  "Magica"     "De Spell"   
221 "Hospice"        "12/2/2023"  "23/04/2024" 1 2024 "F" "20/12/1962" "West"  "Magica"     "De Spell"   
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#4

22 Feb 2025, 12:05

So, it sounds like all of the variables other than excelyearsheet should be exact duplicates, although it is also true that patient_id, hospital_unit, admission_date, and discharge_date should uniquely determine all of the other variables. Before attempting any age-grouping or other analysis, removing the duplicates should be the first things you do.

You should do this carefully: don't just assume that the variables that should agree across the "duplicates" do. Anticipate the possibility of errors, and check for them before proceeding:

Code:

// STEP 1: VERIFY CONSISTENCY OF VARIABLES ds excelyearsheet, not local all_but_excelyearsheet `r(varlist)' local key_vars patient_id hospital_unit admission_date discharge_date local other_vars: list all_but_excelyearsheet - key_vars foreach v of varlist `other_vars' { display "Verifying consistency of `v'" by `key_vars' (`v'), sort: assert `v'[1] == `v'[_N] } // STEP 2: REMOVE DUPLICATES by `key_vars', sort: keep if _n == 1 drop excelyearsheet

If the first loop generates any error messages, then you have found inconsistencies on the "same" admission in different excelyearsheets. Then you need to go back and reconcile the discrepancies in some way. This probably requires the participation of whoever created those spreadsheets in the first place.

Once the above code runs without errors, you are ready to define age-groups, compute incidence rates, etc.
Comment
Gianfranco Di Gennaro

Join Date: Nov 2020

Posts: 135
#5

22 Feb 2025, 12:38

Dear Clyde Schechter ,

Thank you for your help. I’ve reviewed the dataset, and it contains many errors that cannot be resolved. Here’s what I plan to do:
Analyze the Excel sheets separately, removing all periods before and after the start of the year corresponding to the sheet being analyzed.

Keep only one record per patient: the one with the latest discharge. I will remove duplicates and perform a logistic regression to evaluate the odds of death at discharge.

I will assume that:
The patient case-mix is similar across the dataset.

The length of hospital stays is comparable.

I’m not sure if these are good ideas, but at the moment, it’s the only approach I can think of.

Thank you, as always, for your support!
Gianfranco
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29818
#6

22 Feb 2025, 13:05

My experience working with data sets extracted from clinical or administrative data bases is similar to what you are experiencing: there are usually inconsistencies that are difficult or impossible to rectify. So one must make some assumptions in order to work with such data. The assumptions need to be reasonable in the context of the workings of the data source and the analyses you plan to do. As I am unfamiliar with your context, I cannot judge whether these assumptions are reasonable in your situation or not--if you are not confident about them, you would be better off checking them with a colleague who is perhaps more experienced or more senior. Once you are confident, proceed.

Do remember that in a final report of what you do, these difficulties, and the assumptions made and steps taken to overcome them, should be set out clearly.

Added: Notwithstanding my disclaimer about context and research aims, I should point out that I am skeptical about the plan to retain only the final admission for each patient. If your goal is still to calculate monthly mortality rates for each hospital, this will bias your estimates. The problem is that during the months of the non-final admissions, the patient should be counted in the denominator for that month, but your method will omit them. And, of course anyone who has a non-final admission necessarily survived that admission, so this process is a differential exclusion of survivors from some months' calculations. Duplicate records of the same admission, of course, should be removed to avoid double (multiple)-counting. But I do think you need to retain one (and only one) record for each admission.

Last edited by Clyde Schechter; 22 Feb 2025, 13:15.
Comment
Gianfranco Di Gennaro

Join Date: Nov 2020

Posts: 135
#7

22 Feb 2025, 14:01

Dear Clyde Schechter
It's really difficult to make assumptions. Essentially, we want to see if the mortality incidence rate changes in the period from May 2020 to December 2020 compared to the period from May to December, at least in 2021 and 2022, in various elderly care departments, where, in general, patient characteristics are fairly constant and homogeneous. I'm using the monthly incidence for those months because it seems to me that the monthly rate is interpretable and because I'm not able to program the extraction of the number of hospitalization days for that specific May-December period. I understand what you're saying about the risk of keeping the last record.
Comment

Announcement