Help!! Data cleaning, creating new variables with observations based on complicated conditions

Nishan Lamichhane

Join Date: Nov 2021
Posts: 36

Help!! Data cleaning, creating new variables with observations based on complicated conditions

14 Mar 2022, 14:25

Please help me with cleaning this data:

below is my data from excel file and I have 1000s of rows more like this.

patientid and patientid2 are same patient id number. (a patient has many measurements of iron & calcium)
startdate is the date we started treatment-baseline date
testdate is the laboratory test date for iron or calcium
value is the measured value of calcium or iron concentration

Now I need concentration values at baseline for each patient for calcium and iron which is the closest date after treatment or the same date as treatment started.
& also calcium_final iron_final as the final measurements which is the latest tested measurement
calcium_final_date and iron_final_date are basically the latest test dates.

what could be the best possible approach to solve this mess?

patientid	startdate	patientid2	testdate	parameter	value	calcium_base	calcium_final	calcium_final_date	iron_base	iron_final	iron_final_date
1122	10/07/2016	1122	9/26/2018	iron	227
		1122	11/08/2017	iron	194
		1122	9/29/2017	iron	4520
		1122	12/29/2016	iron	416
		1122	9/30/2016	iron	3238
		1122	11/21/2017	calcium	4
		1122	9/28/2017	calcium	11
		1122	01/09/2017	calcium	22
		1122	9/29/2016	calcium	33
1127	10/20/2016	1127	10/10/2018	iron	25
		1127	10/23/2017	iron	121
		1127	01/09/2017	iron	94
		1127	10/07/2016	iron	152
		1127	10/09/2017	calcium	24
		1127	01/09/2017	calcium	22
		1127	10/11/2016	calcium	222
1199	4/17/2018	1199	3/28/2018	calcium	5000
		1199	1/22/2018	iron	669
1289	08/11/2017	1289	6/27/2018	iron	219
		1289	01/08/2018	iron	221
		1289	9/21/2017	calcium	6000
		1289	6/27/2018	calcium	5
1450	11/13/2015	1450	11/08/2021	iron	509
		1450	5/17/2021	iron	30
		1450	5/29/2020	iron	117
		1450	01/08/2020	calcium	3457
		1450	07/02/2019	calcium	502

Tags: data cleaning, date, merging, newvar, variables

Nishan Lamichhane

Join Date: Nov 2021

Posts: 36
#2

14 Mar 2022, 14:33

copying and pasting the data above in the excel file should help to explore the possibilities.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

14 Mar 2022, 15:04

There is no point in writing Stata code for data that does not exist in Stata. If you want to do this in Excel (which I strongly recommend against unless this is just for amusement) then you should seek help from an Excel forum.

If you want to do this in Stata, first import the data to Stata. Then use the -dataex- command to show an example of your Stata data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

If you do that, I'm confident you will get a timely and helpful response.
1 like
Comment

Nishan Lamichhane

Join Date: Nov 2021
Posts: 36

14 Mar 2022, 15:17

Thank you Clyde. I am new to stata and didn't know about this dataex command ( really so useful) . below is the data in stata

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int patientid float startdate int patientid2 float testdate str7 parameter int value byte(calcium_base calcium_final calcium_final_date iron_base iron_final iron_final_date)
1122 20645 1122 21453 "iron"     227 . . . . . .
   .     . 1122 21042 "iron"     194 . . . . . .
   .     . 1122 21091 "iron"    4520 . . . . . .
   .     . 1122 20817 "iron"     416 . . . . . .
   .     . 1122 20727 "iron"    3238 . . . . . .
   .     . 1122 21144 "calcium"    4 . . . . . .
   .     . 1122 21090 "calcium"   11 . . . . . .
   .     . 1122 21063 "calcium"   22 . . . . . .
   .     . 1122 20726 "calcium"   33 . . . . . .
1127 20747 1127 21467 "iron"      25 . . . . . .
   .     . 1127 21115 "iron"     121 . . . . . .
   .     . 1127 21063 "iron"      94 . . . . . .
   .     . 1127 20645 "iron"     152 . . . . . .
   .     . 1127 21072 "calcium"   24 . . . . . .
   .     . 1127 21063 "calcium"   22 . . . . . .
   .     . 1127 20768 "calcium"  222 . . . . . .
1199 21291 1199 21271 "calcium" 5000 . . . . . .
   .     . 1199 21206 "iron"     669 . . . . . .
1289 21131 1289 21362 "iron"     219 . . . . . .
   .     . 1289 21397 "iron"     221 . . . . . .
   .     . 1289 21083 "calcium" 6000 . . . . . .
   .     . 1289 21362 "calcium"    5 . . . . . .
1450 20405 1450 22503 "iron"     509 . . . . . .
   .     . 1450 22417 "iron"      30 . . . . . .
   .     . 1450 22064 "iron"     117 . . . . . .
   .     . 1450 22128 "calcium" 3457 . . . . . .
   .     . 1450 21587 "calcium"  502 . . . . . .
end
format %td startdate
format %td testdate

Last edited by Nishan Lamichhane; 14 Mar 2022, 15:54.

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

14 Mar 2022, 15:23

copying and pasting the data above in the excel file should help to explore the possibilities.

Not really, because it is possible that anomalies in the "1000s of rows more like this" that we don't see will cause Stata to import your full spreadsheet differently than it does these few rows. So then someone develops code that works on these 27 observations but it then fails when you apply it to the full dataset. And then someone has to figure out what needs to be done to prepare the full dataset to meet the needs of this analysis.

One step at a time. The first step is to create the Stata data you need to work with. Then and only then do you invest the time in solving your calculation problem.

Added in edit: This post crossed with your response in post #4. I note that your dates were imported as string variables rather than as Stata dates, as was the case when I imported your 27 observations in post #1. This suggests there are cells in the date columns of your spreadsheet that are neither empty nor have Excel dates in them. We don't know what specific import excel command you used to import your data, but one common cause of this sort of problem is that you have left it to Stata to infer the limits of the data on your Excel spreadsheet, and it has picked up some extra rows at the end. You should use Stata's Data Editor Browse window to look at your Stata data and scroll to the bottom and see if there are extra rows, and if so, rerun the import excel using the cellrange() option to select only those rows and columns that contain your data.

If my guess isn't right, then you have some junk in your Excel date columns that might be easier for you to sort out in Excel rather than in Stata. Stata does a nice job of importing and formatting the dates in the 27 example observations, it would be a shame to have to waste the time to reproduce that by starting with string variables.

Last edited by William Lisowski; 14 Mar 2022, 15:36.
Comment
Nishan Lamichhane

Join Date: Nov 2021

Posts: 36
#6

14 Mar 2022, 15:31

yes, I got your point William. The command is basically for each unique patientid2 we need a value with parameter calcium that has the difference difference between startdate and testdate zero or smallest
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

14 Mar 2022, 15:38

I edited post #5 after post #6 went up, just posting this to be sure my edits are noticed.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

14 Mar 2022, 16:11

Crossed with #5-#7. The following code takes your -dataex- as being correctly imported. Wiliam Lisowski is right that there is something suspicious about the testdate variable being imported as a string. Be that as it may, the code below converts it to a Stata internal format date variable. If a better import of the data makes that unnecessary, you can just remove that part of the code.

Code:

assert patientid == patientid2 if !missing(patientid) drop patientid replace startdate = startdate[_n-1] if missing(startdate) by patientid2 testdate parameter, sort: assert _N == 1 foreach v of varlist *date { gen _`v' = daily(`v', "MDY") assert !missing(_`v') format _`v' %td drop `v' rename _`v' `v' } by patientid2 parameter (testdate), sort: egen base_date = /// min(cond(testdate >= startdate, testdate, .)) by patientid2 parameter (testdate): egen base_value = /// max(cond(testdate == base_date, value, .)) by patientid2 parameter (testdate): gen final_date = testdate[_N] by patientid2 parameter (testdate): gen final_value = value[_N] drop testdate value by patientid2 parameter: keep if _n == 1 rename (base* final*) =_ reshape wide base* final*, i(patientid2) j(parameter) string

Notes: You have two variables patientid, and patientid2, that appear to be the same, except that patient id is missing in all but the first observation for each patient. That makes patient id superfluous, so the code verifies this assumption and then drops patientid.

Patientid 1199 has no tests that precede the treatment start date, so baseline values are missing. You didn't say what to do about the final value if all values precede startdate, so I just let Stata pick up the last of the pre-startdate values as the "final" one.

It is possible in real life, though perhaps not in your data, for a patient to have two different test results on the same date. In this case, both the baseline value and the final value for either analyte could be undefined. The code tests for this possibility and it will halt with an "assertion is false" error message if it finds violations. You will then have to explore those cases and fix the data.
1 like
Comment
Nishan Lamichhane

Join Date: Nov 2021

Posts: 36
#9

20 Mar 2022, 13:09

Thank you Clyde . your solution made my work easier.
Comment

Announcement

Help!! Data cleaning, creating new variables with observations based on complicated conditions

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment