Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mathematical operations within a string variable

    Dears,

    I have a dataset with data on pharmaceutical supplies with around 7million observations. One of the variables is on the pharmaceutical presentation, and I have around 1200 different types of presentations. I want to go from the pharmaceutical presentation to the total amount of substance per box
    The variable is in string as follows and I want to multiply the first number (amount of substance per pill) per the number in the end of the string (number of pills in a box)

    40 MG COM REV CT BL AL PLAS TRANS X 10
    40 MG COM REV CT BL AL PLAS TRANS X 20
    40 MG COM REV CT BL AL PLAS TRANS X 30
    5 MG COM CT 2 BL AL PLAS INC X 10
    5 MG COM CT BL AL / AL X 20
    5 MG COM CT BL AL / AL X 30
    5 MG COM CT BL AL PLAS BCO LEIT X 20
    5 MG COM CT BL AL PLAS INC X 100
    5 MG COM CT BL AL PLAS INC X 20
    5 MG COM CT BL AL PLAS INC X 20
    5 MG COM CT BL AL PLAS INC X 30
    5 MG COM CT BL AL PLAS OPC X 20
    5 MG COM CT BL AL PLAS OPC X 30

    Problem 2: some of thee strings (less than 10%) have information in parenthesis that do not interest me in the end, i.e:
    20 MG COM DISP CT 2 BL AL PLAS INC X 14 (PORT 344/98 L - C1)
    25 MG CAP GEL DURA CT BL AL PLAS INC X 50 (EMB HOSP)

    Do you have any suggestions on how to solve it?

    Thanks so much
    Last edited by Marcelo Dell; 31 Aug 2021, 08:46.

  • #2
    Assuming that the second number is always the next one after the "X"
    Code:
    clear
    input str40 s
    "40 MG COM REV CT BL AL PLAS TRANS X 10"
    "40 MG COM REV CT BL AL PLAS TRANS X 20"
    "40 MG COM REV CT BL AL PLAS TRANS X 30"
    "5 MG COM CT 2 BL AL PLAS INC X 10"
    "5 MG COM CT BL AL / AL X 20"
    "5 MG COM CT BL AL / AL X 30"
    "5 MG COM CT BL AL PLAS BCO LEIT X 20"
    "5 MG COM CT BL AL PLAS INC X 100"
    "5 MG COM CT BL AL PLAS INC X 20"
    "5 MG COM CT BL AL PLAS INC X 20"
    "5 MG COM CT BL AL PLAS INC X 30"
    "5 MG COM CT BL AL PLAS OPC X 20"
    "5 MG COM CT BL AL PLAS OPC X 30"
    end
    //
    replace s = lower(s) // avoid case problems
    gen n1 = real(word(s,1))
    gen n2 = .
    // n2 assumed at next position past "x"
    local maxword = 100 // should be big enough
    forval i = 1/`maxword' {
        quiet replace n2 = real(word(s,`i' + 1)) ///
              if (word(s,`i') == "x")
    }
    gen npill = n1 * n2

    Comment


    • #3
      Code:
      gen p = real(ustrregexs(1))*real(ustrregexs(2)) if ustrregexm(ustrregexra(s,"\s\(.*?\)",""), "(^\d+)\s.*\s(\d+$)")

      Comment


      • #4
        Dear Mike Lacy,

        Thank you so much, you helped a lot but there are still some issues going on.

        For n1 I got 1,516,658 missing numbers out of roughly 8million observations. I believe it was due to decimal numbers and when MG came attached to the number
        , i.e:
        0,25 MG COM CT BL AL PLAS INC X 20
        0,25 MG COM CT BL AL PLAS INC X 30
        2,0 MG COM CT BL AL PLAS TRANS X 30
        3,0 MG COM CT BL AL PLAS TRANS X 20
        15MG COMP REV CT BL AL PLAS INC X 60

        For n2 I had also quite a lot of missing values (1.062.967) generated but I could not figure out the reason, probably there is more than one reason such as the string being too long and letters attached to the number. Follow some examples:
        0,25 MG COM CT BL AL PLAS TRANS X 30
        20 MG COM REV CT BL AL PLAS INC X 28
        2,5 MG/ML SOL OR CT FR VD AMB CGT X 20ML
        10 MG COM REV CX BL AL PLAS TRANS X 30
        25 MG COM CT BL AL PLAS TRANS X 20
        25 MG COM REV CT BL AL PLAS TRANS X 20
        20 MG CAP DURA CT BL AL PLAS TRANS X 28
        25 MG COM REV CT BL AL PLAS TRANS X 28
        2 MG COM CT BL AL PLAS INC X 20
        2,5 MG/ML SOL OR CT FR VD AMB CGT X 20 ML
        37,5 MG CAP DURA LIB PROL CT BL AL PLAS TRANS X30
        25 MG COM REV CT BL AL PLAS INC X 20 01

        Do you think it is possible to fix at least some of these issues? Thanks again


        Dear Bjarte Aagnes, thanks for you help but with your code I got more than 7million missing values out of 8million.

        Comment


        • #5
          Dears,
          I have a similar dataset with Marcelo. I want to calculate the number of daily dose from the string descriptions of dosing regimen (unstructured free text). My dataset has about 7 million observations and the string variables describing the dosing are many, more thousands of descriptions. The following is the example.


          1-2 TAB PRN. MAX 4
          1 TABLET 3 TIMES DAILY .
          1-2 TABLET 2-3 TIMES DAILY
          1-2 TABLET 2-3 TIMES DAILY
          1 TABLET 3-4 TIMES DAILY
          2 KAPSUL 4 TIMES DAILY
          1 TABLETT 2-3 TIMES DAILY

          I want to calculate the number of pills per day. Due to many descriptions, I have extracted only the numerical. So it becomes like this:
          dosing extracted numerical required for calculation
          1-2 TAB PRN. MAX 4 124 1*4
          1 TABLET 3 TIMES DAILY . 1223 1*3
          1-2 TABLET 2-3 TIMES DAILY 1223 1*3
          1-2 TABLET 2-3 TIMES DAILY 1223 1*3
          1 TABLET 3-4 TIMES DAILY 134 1*4
          2 KAPSUL 4 TIMES DAILY 24 2*4
          1 TABLETT 2-3 TIMES DAILY 123 1*3
          For most observations, I need to multiply the first number with the last number to get the number of pills. I don't know how to do this as some of the extracted have 2 digits, 3 digits, 4 digits and etc...I also want to restrict the calculation is only for any of extracted numerical <=4. Appreciate any helps.
          Thanks



          Comment


          • #6
            Consider this:
            Code:
            clear
            input str40 dosage
            "1-2 TAB PRN. MAX 4" 
            "1 TABLET 3 TIMES DAILY ." 
            "1-2 TABLET 2-3 TIMES DAILY" 
            "1-2 TABLET 2-3 TIMES DAILY"
            "1 TABLET 3-4 TIMES DAILY" 
            "2 KAPSUL 4 TIMES DAILY"
            "1 TABLETT 2-3 TIMES DAILY"
            end
            
            gen first_num = real(ustrregexs(1)) if ustrregexm(dosage,"([\d]+)")
            gen last_num = real(ustrregexs(1)) if ustrregexm(ustrreverse(dosage),"([\d]+)")
            which produces:
            Code:
            . list, noobs sep(0) abbrev(10)
            
              +---------------------------------------------------+
              |                     dosage   first_num   last_num |
              |---------------------------------------------------|
              |         1-2 TAB PRN. MAX 4           1          4 |
              |   1 TABLET 3 TIMES DAILY .           1          3 |
              | 1-2 TABLET 2-3 TIMES DAILY           1          3 |
              | 1-2 TABLET 2-3 TIMES DAILY           1          3 |
              |   1 TABLET 3-4 TIMES DAILY           1          4 |
              |     2 KAPSUL 4 TIMES DAILY           2          4 |
              |  1 TABLETT 2-3 TIMES DAILY           1          3 |
              +---------------------------------------------------+

            Comment


            • #7
              Hi Hemanshu Kumar,
              Many thanks. Highly appreciated! It works for most of the observations. Only for some two digits numerical, it extracts both two and make it in reverse order. Example as below:

              dosage first_num last_num
              1-2 kapsul prn. Max 14 1 41

              Appreciate if you could share how to resolve for the above to get only '4' extracted.
              Thanks
              Last edited by Ches Zin; 18 Jun 2023, 07:27.

              Comment


              • #8
                Ah of course. So here are three versions of the code for the last number, which produce different results (of which version 0 is the same as above, and is wrong)

                Code:
                gen last_num0 = real(ustrregexs(1)) if ustrregexm(ustrreverse(dosage),"([\d]+)")
                gen last_num1 = real(ustrreverse(ustrregexs(1))) if ustrregexm(ustrreverse(dosage),"([\d]+)")
                gen last_num2 = real(usubstr(ustrregexs(1), 1, 1)) if ustrregexm(ustrreverse(dosage),"([\d]+)")
                which produce
                Code:
                . list, noobs sep(0) abbrev(10)
                
                  +----------------------------------------------------------------------------+
                  |                     dosage   first_num   last_num0   last_num1   last_num2 |
                  |----------------------------------------------------------------------------|
                  |         1-2 TAB PRN. MAX 4           1           4           4           4 |
                  |   1 TABLET 3 TIMES DAILY .           1           3           3           3 |
                  | 1-2 TABLET 2-3 TIMES DAILY           1           3           3           3 |
                  | 1-2 TABLET 2-3 TIMES DAILY           1           3           3           3 |
                  |   1 TABLET 3-4 TIMES DAILY           1           4           4           4 |
                  |     2 KAPSUL 4 TIMES DAILY           2           4           4           4 |
                  |  1 TABLETT 2-3 TIMES DAILY           1           3           3           3 |
                  |     1-2 kapsul prn. Max 14           1          41          14           4 |
                  +----------------------------------------------------------------------------+

                Comment


                • #9
                  Dear Hemanshu,
                  Thanks again. It works!

                  Comment


                  • #10
                    Dears,
                    I have the following datasets of pharmacy dosing that require me to calculate the number of tablets per days. I have done the extraction of the first number(first_num) and last number(last_num) and then multiply these two variables like my previous sharing in this forum, but it did not work for most of datasets. Then I tried to extract all the numerical characters from the free text of doser and I got the number as recorded in the ndd and I changed it one by one according to the ndd extracted. But this requires a lot of times and I have about 2 millions of these datasets. Then I think that maybe regex command for a certain word can help such as "1 capsule" but I don't know how to use this for a subsequent "1 capsule" and total up as the number of tablets per day. I appreciate any help on this. Thanks
                    doser first_num last_num ndd num of tab_day
                    1 capsule at 07:00, 1 capsule at 11:00, 1 capsule at 15:00, 1 capsule at 19:00, 1 capsule at 23:00. 1 0 1070011100115001190012300 5
                    1 capsule at 07:00, 1 capsule at 12:00. 1 0 1070011200 2
                    1 capsule at 07:00, 1 capsule at 12:00. 1 0 1070011200 2
                    1 tablet at 08:00, 3 tablet at 22:00. For pain 1 0 1080032200 4
                    1 capsule at 08, 1 capsule at 12, 1 capsule at 16, 1 capsule at 20 pain reliever 1 2 108112116120 4
                    0.5 tablet morning, 0.5 tablet evening for pain for 1 week after operation 0 1 5051 1
                    One tablet morning at 08:00, one tablet evening at 20:00 for pain 8 0 8002000 2
                    1 capsule 1-3 times per day as needed when treatment with Alvedon 665 mg and Oxycodone 5 mg is insufficient 1 5 1136655 3
                    2 tablet morning and evening for pain 2 2 2 4
                    One tablet 2 times daily 2 2 2 2
                    6 tablet morning, 5 tablet lunch, 5 tablet evening 6 5 655 16

                    Comment


                    • #11
                      You may want to do certain substitutions before you pick up the first and last numbers. For example:

                      Code:
                      gen test = ustrregexra(doser, "\d{2}\:\d{2}", "") // removes timings in hh:mm format
                      replace test = subinstr(lower(test), "one", "1", .) // substitutes all instances of the word "One" (case insensitive) with the digit 1

                      Comment


                      • #12
                        Dear Hemanshu,
                        Million thanks for your help!. After I ran the command, there are still the timing that was recorded like at 08, at 12, etc.. not changed. I am not sure how to differentiate this timing from the number of tablets. Also, I don't know why the "one" is still not changed to "1" after the replace command. Appreciate your help. Also on removing the strength ie.665 mg and 5 mg. Thanks in advance!
                        Regards


                        test
                        1 capsule at , 1 capsule at , 1 capsule at , 1 capsule at , 1 capsule at .
                        1 capsule at , 1 capsule at .
                        1 capsule at , 1 capsule at .
                        1 tablet at , 3 tablet at . for pain
                        1 capsule at 08, 1 capsule at 12, 1 capsule at 16, 1 capsule at 20 pain reliever
                        0.5 tablet morning, 0.5 tablet evening for pain for 1 week after operation
                        one tablet morning at , one tablet evening at for pain
                        1 capsule 1-3 times per day as needed when treatment with alvedon 665 mg and oxycodone 5 mg is insufficient
                        2 tablet morning and evening for pain
                        one tablet 2 times daily
                        6 tablet morning, 5 tablet lunch, 5 tablet evening

                        Comment


                        • #13
                          Try
                          Code:
                          replace test = ustrregexra(test, "[\d]+\smg", "")
                          replace test = ustrregexra(test, "\bat\s[\d]+", "")
                          replace test = ustrregexra(lower(test), "\bone\b", "1")
                          Also, in #11, please remove the second line of code, to replace "one"; I just realised it will create false positives whenever the letters "one" appear as part of some other word, e.g. "alone". Use the third line of code given here, instead.
                          Last edited by Hemanshu Kumar; 17 Jul 2023, 12:04.

                          Comment


                          • #14
                            Dear Hemanshu,
                            Brilliant! It works well. Thank you and highly appreciated.
                            Best,

                            Comment


                            • #15
                              Dear Dr Hemanshu and Stata folks,
                              Thanks for your help in cleaning my dataset. I have run all the above useful suggested commands and it works well. Just further questions on certain dosing regimens as below. For the dosing that start with 0.5 tablet, the total should be 0.5 +0.5 =1. However, the extracted total number(tnum) was 5051 and it ended with total tablets of 11. For the dosing that start with "2 tablet morning and evening for pain", the total number should be 4 (i.e, 2 in the morning and 2 in the evening). However, the 2 in the evening was not included in this unstructured free text dosing regimen. Appreciate any help. Many thanks


                              dos_ndd tnum length tndd wanted
                              1 capsule , 1 capsule , 1 capsule , 1 capsule pain reliever 1111 4 4 4
                              0.5 tablet morning, 0.5 tablet evening for pain for 1 week after operation 5051 5 11 1
                              1 tablet morning at , 1 tablet evening at for pain 11 2 2 2
                              1 capsule 1-3 times per day as needed when treatment with alvedon and oxycodone is insufficient 113 3 3 3
                              2 tablet morning and evening for pain 2 1 2 4

                              Comment

                              Working...
                              X