Hi All -
I am hoping to get some input on a concatenated strings issue I can't seem to figure out. Using STATA MP 14.2, fully updated.
My data are patient-level data, but include the text of operative notes from patient charts, and the original file is .xlsx. Because of size limitations, the operative notes spilled over into additional rows. The person who created the dataset was lovely and each row repeats the patient's information with an identifier for LINE which identifies which part of the operative note. I don't know which program was used to create the original text of the operative notes.
Example data:
PatientID EncounterID NoteID LINE NOTE_TEXT
XXXX YYYY ZZZZ 1 Blah blah blah
XXXX YYYY ZZZZ 2 More of the same note
XXXX YYYY AAAA 1 New operative note during same hospital stay
XXXX YYYY AAAA 2 that keeps going into this line...
Occasionally, a patient will go to the OR more than once during the same hospital stay (encounter ID), and so there may be two NoteIDs, with multiple values of LINE for each NoteID. Some patients also have more than one encounter in the system over the time during which we have taken our data.
WIth STATAs new-ish strL function, I was very much hoping I could get all of the note text for each op note into one cell for each patient...
I started by reshaping the data to wide based on Note ID so that for each noteID, i have the text of that note in NoteText1, NoteText2, NoteText3, etc.
This all seems to work fine. The issue comes when I try to concatenate the Note text one to another and get them all into one cell. For some reason, the string data are truncated at what seem to be very random points in the concatenation. There is no set string length at which the concatenation ends, nor is it a function of only getting through two cells and then stopping. The truncation occurs in the middle of cells for nearly every single instance of truncation that I have gone through. After concatenation, the finished string lengths vary from 300s - 8700, so there does not seem to be a rhyme or reason to the truncation.
A couple problems I tried to address:
1) The NoteText variable string texts have multiple carriage returns and/or line breaks because they were directly copied from the chart. I have performed the concatenation with and without removing these using the various string trim functions (both unicode and standard). The results are the same whether carriage returns have been removed or not (that is, the string ends abruptly at the exact same point regardless of whether these extra characters are removed). Side note: ideally I could keep the carriage returns, because I'd like to output these op notes back to a text file for chart review purposes...
2) Creation of a strL variable before vs during: I know that STATA is automatically supposed to bump strings "to the next level" if the string created exceeds the set limit. When initially creating the variable, I just generated it from the gen or egen command. When that didn't work, I tried to create an empty strL variable and use the replace command. Again, the results are identical, and strings are truncated at seemingly random points
3) Use of concat vs var+ var+ var: I have tried both thetyped out gen (or replace) varname = notetext1 + notetext2 + notetext3... and also egen varname = concat(notetext*). Again, both approaches result in the same outcome.
4) .xlsx vs delimited .txt files. I have been importing the original data from Excel because it seems to handle it correctly. I tried importing as tab-delimited .txt thinking this might address it, but I get jumbled data; even when specifying tab rather than automatic delimiter, each carriage return in the op note becomes a new observation/row.
I guess the question is, what might I be missing that could be causing the strings to truncate rather than continuing the concatenation in strL variables?
The code I've tried is below.
Thanks!
Michelle
reshape wide NOTE_TEXT PROVIDER CONTACT_SERIAL_NUM, i(NOTE_ID) j(LINE)
(note: j = 1 2 3 4 5 6)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 8048 -> 3182
Number of variables 10 -> 24
j variable (6 values) LINE -> (dropped)
xij variables:
NOTE_TEXT -> NOTE_TEXT1 NOTE_TEXT2 ... NOTE_TEXT6
PROVIDER -> PROVIDER1 PROVIDER2 ... PROVIDER6
CONTACT_SERIAL_NUM -> CONTACT_SERIAL_NUM1 CONTACT_SERIAL_NUM2 ... CONTACT_SERIAL_NUM6
-----------------------------------------------------------------------------
egen fullnote = concat(NOTE_TEXT*)
(3 missing values generated)
drop fullnote
gen strL fullnote = ""
replace fullnote = (NOTE_TEXT1 + NOTE_TEXT2 + ...NOTE_TEXT6)
(3 missing values generated)
I am hoping to get some input on a concatenated strings issue I can't seem to figure out. Using STATA MP 14.2, fully updated.
My data are patient-level data, but include the text of operative notes from patient charts, and the original file is .xlsx. Because of size limitations, the operative notes spilled over into additional rows. The person who created the dataset was lovely and each row repeats the patient's information with an identifier for LINE which identifies which part of the operative note. I don't know which program was used to create the original text of the operative notes.
Example data:
PatientID EncounterID NoteID LINE NOTE_TEXT
XXXX YYYY ZZZZ 1 Blah blah blah
XXXX YYYY ZZZZ 2 More of the same note
XXXX YYYY AAAA 1 New operative note during same hospital stay
XXXX YYYY AAAA 2 that keeps going into this line...
Occasionally, a patient will go to the OR more than once during the same hospital stay (encounter ID), and so there may be two NoteIDs, with multiple values of LINE for each NoteID. Some patients also have more than one encounter in the system over the time during which we have taken our data.
WIth STATAs new-ish strL function, I was very much hoping I could get all of the note text for each op note into one cell for each patient...
I started by reshaping the data to wide based on Note ID so that for each noteID, i have the text of that note in NoteText1, NoteText2, NoteText3, etc.
This all seems to work fine. The issue comes when I try to concatenate the Note text one to another and get them all into one cell. For some reason, the string data are truncated at what seem to be very random points in the concatenation. There is no set string length at which the concatenation ends, nor is it a function of only getting through two cells and then stopping. The truncation occurs in the middle of cells for nearly every single instance of truncation that I have gone through. After concatenation, the finished string lengths vary from 300s - 8700, so there does not seem to be a rhyme or reason to the truncation.
A couple problems I tried to address:
1) The NoteText variable string texts have multiple carriage returns and/or line breaks because they were directly copied from the chart. I have performed the concatenation with and without removing these using the various string trim functions (both unicode and standard). The results are the same whether carriage returns have been removed or not (that is, the string ends abruptly at the exact same point regardless of whether these extra characters are removed). Side note: ideally I could keep the carriage returns, because I'd like to output these op notes back to a text file for chart review purposes...
2) Creation of a strL variable before vs during: I know that STATA is automatically supposed to bump strings "to the next level" if the string created exceeds the set limit. When initially creating the variable, I just generated it from the gen or egen command. When that didn't work, I tried to create an empty strL variable and use the replace command. Again, the results are identical, and strings are truncated at seemingly random points
3) Use of concat vs var+ var+ var: I have tried both thetyped out gen (or replace) varname = notetext1 + notetext2 + notetext3... and also egen varname = concat(notetext*). Again, both approaches result in the same outcome.
4) .xlsx vs delimited .txt files. I have been importing the original data from Excel because it seems to handle it correctly. I tried importing as tab-delimited .txt thinking this might address it, but I get jumbled data; even when specifying tab rather than automatic delimiter, each carriage return in the op note becomes a new observation/row.
I guess the question is, what might I be missing that could be causing the strings to truncate rather than continuing the concatenation in strL variables?
The code I've tried is below.
Thanks!
Michelle
reshape wide NOTE_TEXT PROVIDER CONTACT_SERIAL_NUM, i(NOTE_ID) j(LINE)
(note: j = 1 2 3 4 5 6)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 8048 -> 3182
Number of variables 10 -> 24
j variable (6 values) LINE -> (dropped)
xij variables:
NOTE_TEXT -> NOTE_TEXT1 NOTE_TEXT2 ... NOTE_TEXT6
PROVIDER -> PROVIDER1 PROVIDER2 ... PROVIDER6
CONTACT_SERIAL_NUM -> CONTACT_SERIAL_NUM1 CONTACT_SERIAL_NUM2 ... CONTACT_SERIAL_NUM6
-----------------------------------------------------------------------------
egen fullnote = concat(NOTE_TEXT*)
(3 missing values generated)
drop fullnote
gen strL fullnote = ""
replace fullnote = (NOTE_TEXT1 + NOTE_TEXT2 + ...NOTE_TEXT6)
(3 missing values generated)
Comment