Mathematical operations within a string variable

Marcelo Dell

Join Date: Aug 2020

Posts: 6
#1

Mathematical operations within a string variable

31 Aug 2021, 08:43

Dears,

I have a dataset with data on pharmaceutical supplies with around 7million observations. One of the variables is on the pharmaceutical presentation, and I have around 1200 different types of presentations. I want to go from the pharmaceutical presentation to the total amount of substance per box
The variable is in string as follows and I want to multiply the first number (amount of substance per pill) per the number in the end of the string (number of pills in a box)

40 MG COM REV CT BL AL PLAS TRANS X 10
40 MG COM REV CT BL AL PLAS TRANS X 20
40 MG COM REV CT BL AL PLAS TRANS X 30
5 MG COM CT 2 BL AL PLAS INC X 10
5 MG COM CT BL AL / AL X 20
5 MG COM CT BL AL / AL X 30
5 MG COM CT BL AL PLAS BCO LEIT X 20
5 MG COM CT BL AL PLAS INC X 100
5 MG COM CT BL AL PLAS INC X 20
5 MG COM CT BL AL PLAS INC X 20
5 MG COM CT BL AL PLAS INC X 30
5 MG COM CT BL AL PLAS OPC X 20
5 MG COM CT BL AL PLAS OPC X 30

Problem 2: some of thee strings (less than 10%) have information in parenthesis that do not interest me in the end, i.e:
20 MG COM DISP CT 2 BL AL PLAS INC X 14 (PORT 344/98 L - C1)
25 MG CAP GEL DURA CT BL AL PLAS INC X 50 (EMB HOSP)

Do you have any suggestions on how to solve it?

Thanks so much

Last edited by Marcelo Dell; 31 Aug 2021, 08:46.
Tags: regular expression, Regular expressions, regularexpression, string, strings

Mike Lacy

Join Date: Apr 2014
Posts: 2404

31 Aug 2021, 09:05

Assuming that the second number is always the next one after the "X"

Code:

clear
input str40 s
"40 MG COM REV CT BL AL PLAS TRANS X 10"
"40 MG COM REV CT BL AL PLAS TRANS X 20"
"40 MG COM REV CT BL AL PLAS TRANS X 30"
"5 MG COM CT 2 BL AL PLAS INC X 10"
"5 MG COM CT BL AL / AL X 20"
"5 MG COM CT BL AL / AL X 30"
"5 MG COM CT BL AL PLAS BCO LEIT X 20"
"5 MG COM CT BL AL PLAS INC X 100"
"5 MG COM CT BL AL PLAS INC X 20"
"5 MG COM CT BL AL PLAS INC X 20"
"5 MG COM CT BL AL PLAS INC X 30"
"5 MG COM CT BL AL PLAS OPC X 20"
"5 MG COM CT BL AL PLAS OPC X 30"
end
//
replace s = lower(s) // avoid case problems
gen n1 = real(word(s,1))
gen n2 = .
// n2 assumed at next position past "x"
local maxword = 100 // should be big enough
forval i = 1/`maxword' {
    quiet replace n2 = real(word(s,`i' + 1)) ///
          if (word(s,`i') == "x")
}
gen npill = n1 * n2

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

02 Sep 2021, 01:06

Code:

gen p = real(ustrregexs(1))*real(ustrregexs(2)) if ustrregexm(ustrregexra(s,"\s\(.*?\)",""), "(^\d+)\s.*\s(\d+$)")

Comment

Marcelo Dell

Join Date: Aug 2020

Posts: 6
#4

07 Sep 2021, 08:50

Dear Mike Lacy,

Thank you so much, you helped a lot but there are still some issues going on.

For n1 I got 1,516,658 missing numbers out of roughly 8million observations. I believe it was due to decimal numbers and when MG came attached to the number
, i.e:
0,25 MG COM CT BL AL PLAS INC X 20
0,25 MG COM CT BL AL PLAS INC X 30
2,0 MG COM CT BL AL PLAS TRANS X 30
3,0 MG COM CT BL AL PLAS TRANS X 20
15MG COMP REV CT BL AL PLAS INC X 60

For n2 I had also quite a lot of missing values (1.062.967) generated but I could not figure out the reason, probably there is more than one reason such as the string being too long and letters attached to the number. Follow some examples:
0,25 MG COM CT BL AL PLAS TRANS X 30
20 MG COM REV CT BL AL PLAS INC X 28
2,5 MG/ML SOL OR CT FR VD AMB CGT X 20ML
10 MG COM REV CX BL AL PLAS TRANS X 30
25 MG COM CT BL AL PLAS TRANS X 20
25 MG COM REV CT BL AL PLAS TRANS X 20
20 MG CAP DURA CT BL AL PLAS TRANS X 28
25 MG COM REV CT BL AL PLAS TRANS X 28
2 MG COM CT BL AL PLAS INC X 20
2,5 MG/ML SOL OR CT FR VD AMB CGT X 20 ML
37,5 MG CAP DURA LIB PROL CT BL AL PLAS TRANS X30
25 MG COM REV CT BL AL PLAS INC X 20&NBSP;01

Do you think it is possible to fix at least some of these issues? Thanks again

Dear Bjarte Aagnes, thanks for you help but with your code I got more than 7million missing values out of 8million.
Comment

Ches Zin

Join Date: Jun 2023
Posts: 20

18 Jun 2023, 04:20

Dears,
I have a similar dataset with Marcelo. I want to calculate the number of daily dose from the string descriptions of dosing regimen (unstructured free text). My dataset has about 7 million observations and the string variables describing the dosing are many, more thousands of descriptions. The following is the example.

1-2 TAB PRN. MAX 4
1 TABLET 3 TIMES DAILY .
1-2 TABLET 2-3 TIMES DAILY
1-2 TABLET 2-3 TIMES DAILY
1 TABLET 3-4 TIMES DAILY
2 KAPSUL 4 TIMES DAILY
1 TABLETT 2-3 TIMES DAILY

I want to calculate the number of pills per day. Due to many descriptions, I have extracted only the numerical. So it becomes like this:

dosing	extracted numerical	required for calculation
1-2 TAB PRN. MAX 4	124	1*4
1 TABLET 3 TIMES DAILY .	1223	1*3
1-2 TABLET 2-3 TIMES DAILY	1223	1*3
1-2 TABLET 2-3 TIMES DAILY	1223	1*3
1 TABLET 3-4 TIMES DAILY	134	1*4
2 KAPSUL 4 TIMES DAILY	24	2*4
1 TABLETT 2-3 TIMES DAILY	123	1*3

For most observations, I need to multiply the first number with the last number to get the number of pills. I don't know how to do this as some of the extracted have 2 digits, 3 digits, 4 digits and etc...I also want to restrict the calculation is only for any of extracted numerical <=4. Appreciate any helps.
Thanks

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1326

18 Jun 2023, 06:27

Consider this:

Code:

clear
input str40 dosage
"1-2 TAB PRN. MAX 4" 
"1 TABLET 3 TIMES DAILY ." 
"1-2 TABLET 2-3 TIMES DAILY" 
"1-2 TABLET 2-3 TIMES DAILY"
"1 TABLET 3-4 TIMES DAILY" 
"2 KAPSUL 4 TIMES DAILY"
"1 TABLETT 2-3 TIMES DAILY"
end

gen first_num = real(ustrregexs(1)) if ustrregexm(dosage,"([\d]+)")
gen last_num = real(ustrregexs(1)) if ustrregexm(ustrreverse(dosage),"([\d]+)")

which produces:

Code:

. list, noobs sep(0) abbrev(10)

  +---------------------------------------------------+
  |                     dosage   first_num   last_num |
  |---------------------------------------------------|
  |         1-2 TAB PRN. MAX 4           1          4 |
  |   1 TABLET 3 TIMES DAILY .           1          3 |
  | 1-2 TABLET 2-3 TIMES DAILY           1          3 |
  | 1-2 TABLET 2-3 TIMES DAILY           1          3 |
  |   1 TABLET 3-4 TIMES DAILY           1          4 |
  |     2 KAPSUL 4 TIMES DAILY           2          4 |
  |  1 TABLETT 2-3 TIMES DAILY           1          3 |
  +---------------------------------------------------+

Comment

Ches Zin

Join Date: Jun 2023

Posts: 20
#7

18 Jun 2023, 07:23

Hi Hemanshu Kumar,
Many thanks. Highly appreciated! It works for most of the observations. Only for some two digits numerical, it extracts both two and make it in reverse order. Example as below:

dosage first_num last_num
1-2 kapsul prn. Max 14 1 41

Appreciate if you could share how to resolve for the above to get only '4' extracted.
Thanks

Last edited by Ches Zin; 18 Jun 2023, 07:27.
Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1326

18 Jun 2023, 07:34

Ah of course. So here are three versions of the code for the last number, which produce different results (of which version 0 is the same as above, and is wrong)

Code:

gen last_num0 = real(ustrregexs(1)) if ustrregexm(ustrreverse(dosage),"([\d]+)")
gen last_num1 = real(ustrreverse(ustrregexs(1))) if ustrregexm(ustrreverse(dosage),"([\d]+)")
gen last_num2 = real(usubstr(ustrregexs(1), 1, 1)) if ustrregexm(ustrreverse(dosage),"([\d]+)")

which produce

Code:

. list, noobs sep(0) abbrev(10)

  +----------------------------------------------------------------------------+
  |                     dosage   first_num   last_num0   last_num1   last_num2 |
  |----------------------------------------------------------------------------|
  |         1-2 TAB PRN. MAX 4           1           4           4           4 |
  |   1 TABLET 3 TIMES DAILY .           1           3           3           3 |
  | 1-2 TABLET 2-3 TIMES DAILY           1           3           3           3 |
  | 1-2 TABLET 2-3 TIMES DAILY           1           3           3           3 |
  |   1 TABLET 3-4 TIMES DAILY           1           4           4           4 |
  |     2 KAPSUL 4 TIMES DAILY           2           4           4           4 |
  |  1 TABLETT 2-3 TIMES DAILY           1           3           3           3 |
  |     1-2 kapsul prn. Max 14           1          41          14           4 |
  +----------------------------------------------------------------------------+

Comment

Ches Zin

Join Date: Jun 2023

Posts: 20
#9

18 Jun 2023, 09:45

Dear Hemanshu,
Thanks again. It works!
Comment

Ches Zin

Join Date: Jun 2023
Posts: 20

#10

17 Jul 2023, 07:13

Dears,
I have the following datasets of pharmacy dosing that require me to calculate the number of tablets per days. I have done the extraction of the first number(first_num) and last number(last_num) and then multiply these two variables like my previous sharing in this forum, but it did not work for most of datasets. Then I tried to extract all the numerical characters from the free text of doser and I got the number as recorded in the ndd and I changed it one by one according to the ndd extracted. But this requires a lot of times and I have about 2 millions of these datasets. Then I think that maybe regex command for a certain word can help such as "1 capsule" but I don't know how to use this for a subsequent "1 capsule" and total up as the number of tablets per day. I appreciate any help on this. Thanks

doser	first_num	last_num	ndd	num of tab_day
1 capsule at 07:00, 1 capsule at 11:00, 1 capsule at 15:00, 1 capsule at 19:00, 1 capsule at 23:00.	1	0	1070011100115001190012300	5
1 capsule at 07:00, 1 capsule at 12:00.	1	0	1070011200	2
1 capsule at 07:00, 1 capsule at 12:00.	1	0	1070011200	2
1 tablet at 08:00, 3 tablet at 22:00. For pain	1	0	1080032200	4
1 capsule at 08, 1 capsule at 12, 1 capsule at 16, 1 capsule at 20 pain reliever	1	2	108112116120	4
0.5 tablet morning, 0.5 tablet evening for pain for 1 week after operation	0	1	5051	1
One tablet morning at 08:00, one tablet evening at 20:00 for pain	8	0	8002000	2
1 capsule 1-3 times per day as needed when treatment with Alvedon 665 mg and Oxycodone 5 mg is insufficient	1	5	1136655	3
2 tablet morning and evening for pain	2	2	2	4
One tablet 2 times daily	2	2	2	2
6 tablet morning, 5 tablet lunch, 5 tablet evening	6	5	655	16

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1326

#11

17 Jul 2023, 09:02

You may want to do certain substitutions before you pick up the first and last numbers. For example:

Code:

gen test = ustrregexra(doser, "\d{2}\:\d{2}", "") // removes timings in hh:mm format
replace test = subinstr(lower(test), "one", "1", .) // substitutes all instances of the word "One" (case insensitive) with the digit 1

Comment

Ches Zin

Join Date: Jun 2023
Posts: 20

#12

17 Jul 2023, 11:34

Dear Hemanshu,
Million thanks for your help!. After I ran the command, there are still the timing that was recorded like at 08, at 12, etc.. not changed. I am not sure how to differentiate this timing from the number of tablets. Also, I don't know why the "one" is still not changed to "1" after the replace command. Appreciate your help. Also on removing the strength ie.665 mg and 5 mg. Thanks in advance!
Regards

test

1 capsule at , 1 capsule at , 1 capsule at , 1 capsule at , 1 capsule at .

1 capsule at , 1 capsule at .

1 tablet at , 3 tablet at . for pain

1 capsule at 08, 1 capsule at 12, 1 capsule at 16, 1 capsule at 20 pain reliever

0.5 tablet morning, 0.5 tablet evening for pain for 1 week after operation

one tablet morning at , one tablet evening at for pain

1 capsule 1-3 times per day as needed when treatment with alvedon 665 mg and oxycodone 5 mg is insufficient

2 tablet morning and evening for pain

one tablet 2 times daily

6 tablet morning, 5 tablet lunch, 5 tablet evening

Comment

Hemanshu Kumar

Join Date: Mar 2015

Posts: 1326
#13

17 Jul 2023, 11:41

Try

Code:

replace test = ustrregexra(test, "[\d]+\smg", "") replace test = ustrregexra(test, "\bat\s[\d]+", "") replace test = ustrregexra(lower(test), "\bone\b", "1")

Also, in #11, please remove the second line of code, to replace "one"; I just realised it will create false positives whenever the letters "one" appear as part of some other word, e.g. "alone". Use the third line of code given here, instead.

Last edited by Hemanshu Kumar; 17 Jul 2023, 12:04.
Comment
Ches Zin

Join Date: Jun 2023

Posts: 20
#14

17 Jul 2023, 12:22

Dear Hemanshu,
Brilliant! It works well. Thank you and highly appreciated.
Best,
Comment

Ches Zin

Join Date: Jun 2023
Posts: 20

#15

18 Jul 2023, 07:17

Dear Dr Hemanshu and Stata folks,
Thanks for your help in cleaning my dataset. I have run all the above useful suggested commands and it works well. Just further questions on certain dosing regimens as below. For the dosing that start with 0.5 tablet, the total should be 0.5 +0.5 =1. However, the extracted total number(tnum) was 5051 and it ended with total tablets of 11. For the dosing that start with "2 tablet morning and evening for pain", the total number should be 4 (i.e, 2 in the morning and 2 in the evening). However, the 2 in the evening was not included in this unstructured free text dosing regimen. Appreciate any help. Many thanks

dos_ndd	tnum	length	tndd	wanted
1 capsule , 1 capsule , 1 capsule , 1 capsule pain reliever	1111	4	4	4
0.5 tablet morning, 0.5 tablet evening for pain for 1 week after operation	5051	5	11	1
1 tablet morning at , 1 tablet evening at for pain	11	2	2	2
1 capsule 1-3 times per day as needed when treatment with alvedon and oxycodone is insufficient	113	3	3	3
2 tablet morning and evening for pain	2	1	2	4

Announcement