Dear all,
I have a data management issue in an epidemiological study where the goal is to estimate mortality rate in a hospital across different time periods, specifically from May to December for the years 2020, 2021, 2022, 2023, and 2024.
For each of these years, I have a list of patients who passed through the hospital, organized in different Excel sheets.
To do this, for each patient I would have to calculate the days of hospitalization in each department and his/her status upon discharge (alive or dead).
There are some complications:
When I combine the excel sheets, the data has the following structure:
In a previous study, the invaluable @ClydeSchechter provided me with the tools to obtain one record per patient, per month, and per department, in order to estimate monthly incidences. The method worked perfectly. It was like this:
I would like to apply this method again.
However, in that case, the situation was less complex because the data was relative to a single year, 2020. In this case, I have the problem of all the duplicates that are created when a patient appears with the same admissions in the Excel sheets of different years (e.g., patients who "died twice," etc.). Since we are talking about thousands of patients, checking them individually is impossible.
I wonder if you have any suggestions regarding this issue. I am unsure whether it would be better to remove admissions that appear in multiple sheets beforehand or to eliminate the duplicates that arise after processing.
Thank you all.
Gianfranco
I have a data management issue in an epidemiological study where the goal is to estimate mortality rate in a hospital across different time periods, specifically from May to December for the years 2020, 2021, 2022, 2023, and 2024.
For each of these years, I have a list of patients who passed through the hospital, organized in different Excel sheets.
To do this, for each patient I would have to calculate the days of hospitalization in each department and his/her status upon discharge (alive or dead).
There are some complications:
- A patient may appear in multiple years. For example, a patient admitted on November 17, 2018, and discharged on March 3, 2023, appears in the data for 2020, 2021, 2022, and 2023.
- A patient might have been admitted and discharged multiple times. For instance, the aforementioned patient, discharged on March 3, 2023, could have been readmitted on June 7, 2023, or in the following year (e.g., February 2, 2024), and so on.
When I combine the excel sheets, the data has the following structure:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input int patient_id str14 hospital_unit str10(admission_date discharge_date) byte deceased int excelyearsheet 101 "Day Hospital" "15/06/2020" "25/06/2020" 0 2020 102 "Hospice" "10/08/2021" "30/08/2021" 0 2021 103 "Cardiology" "20/11/2022" "15/01/2023" 1 2022 103 "Cardiology" "20/11/2022" "15/01/2023" 1 2023 104 "Unit A" "05/06/2022" "15/06/2022" 0 2022 104 "Unit A" "20/10/2022" "30/10/2022" 1 2022 105 "Oncology" "10/07/2020" "20/07/2020" 0 2020 105 "Oncology" "15/12/2022" "05/01/2023" 0 2022 105 "Oncology" "15/12/2022" "05/01/2023" 0 2023 106 "Day Hospital" "12/05/2023" "22/05/2023" 0 2023 107 "Unit B" "25/11/2021" "05/01/2022" 0 2021 107 "Unit B" "25/11/2021" "05/01/2022" 0 2022 107 "Unit B" "10/07/2023" "20/07/2023" 0 2023 108 "Day Hospital" "05/06/2024" "15/06/2024" 0 2024 109 "Intensive care" "18/08/2020" "30/08/2020" 1 2020 110 "Cardiology" "28/12/2023" "10/01/2024" 0 2023 110 "Cardiology" "28/12/2023" "10/01/2024" 0 2024 111 "Day Hospital" "12/09/2021" "22/09/2021" 0 2021 112 "Unit C" "07/05/2022" "17/05/2022" 0 2022 112 "Unit C" "12/11/2022" "22/11/2022" 0 2022 113 "Hospice" "15/07/2024" "25/07/2024" 1 2024 114 "Oncology" "20/10/2023" "30/10/2023" 0 2023 115 "Unit D" "05/08/2021" "15/08/2021" 0 2021 115 "Unit D" "02/02/2024" "12/02/2024" 0 2024 221 "Day Hospital" "23/11/2023" "06/02/2024" 0 2023 221 "Day Hospital" "23/11/2023" "06/02/2024" 0 2024 221 "Hospice" "12/2/2023" "23/04/2024" 1 2024 end
Code:
gen start_month = mofd(admission_date) gen end_month = mofd(discharge_date) expand end_month-start_month + 1 by patient_id, sort: gen mdate = dofm(start_month + _n - 1) format *month %tm format mdate %td gen admission_days = min(floor(discharge_date), lastdayofmonth(mdate)) /// - max(admission_date, firstdayofmonth(mdate)) by patient_id (mdate), sort: replace deceased = 0 if _n < _N // CAN ONLY DIE IN LAST MONTH
However, in that case, the situation was less complex because the data was relative to a single year, 2020. In this case, I have the problem of all the duplicates that are created when a patient appears with the same admissions in the Excel sheets of different years (e.g., patients who "died twice," etc.). Since we are talking about thousands of patients, checking them individually is impossible.
I wonder if you have any suggestions regarding this issue. I am unsure whether it would be better to remove admissions that appear in multiple sheets beforehand or to eliminate the duplicates that arise after processing.
Thank you all.
Gianfranco
Comment