Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting strings by semicolon delimiter with few conditions

    Hello all:

    I am trying to split multiple treatment regimens in the CLLtx column and tally the number of treatment lines (clltxn). There are commas and semicolons separating the treatment regimens and generally most are usually acronynms. Some that have missing data, ., or "None" or "No treatment given" should also have a 0 in number of treatment lines.
    I tried changing all commas to semicolons and tried splitting. But some of the 0 are coded as 1. But obviously, I am unable to change those 0 to 1 when only BR or one line of chemo or Radiation was given because it cannot see a semicolon.

    This is the best I could but you will see it miscodes "EBRT" as 0 since it could not find a semicolon (all treatments seem to end in an upper case letter and I dont know how add another clause to code as 1 when it finds the ending letter as upper case).

    gen clltxbetter = ustrregexra(CLLtx, ",", ";") //replace all comma to semicolon
    gen clltxn = ustrregexm(clltxbetter, ";") //if semicolon then a match of 1
    replace clltxn = length(clltxbetter) - length(subinstr(clltxbetter, ";", "", .)) + 1 if clltxn == 1
    replace clltxn = 1 in 56 // CALGB trial is FCR and alemtuzumab
    order clltxn, after(CLLtx)

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str102(CLLtx clltxbetter) float clltxn
    "FCR, Ibrutinib"                                                                                         "FCR; Ibrutinib"                                                                                         2
    "FCO"                                                                                                    "FCO"                                                                                                    0
    ""                                                                                                       ""                                                                                                       0
    ""                                                                                                       ""                                                                                                       0
    ""                                                                                                       ""                                                                                                       0
    "Chlorambucil, Rituximab, Ibrutinib, Idelalisib/Rituximab, Venetoclax"                                   "Chlorambucil; Rituximab; Ibrutinib; Idelalisib/Rituximab; Venetoclax"                                   5
    "Ibrutinib"                                                                                              "Ibrutinib"                                                                                              0
    "PCR, BR, Ibrutinib"                                                                                     "PCR; BR; Ibrutinib"                                                                                     3
    "Ibrutinib, Entospletinib, Obintuzumab"                                                                  "Ibrutinib; Entospletinib; Obintuzumab"                                                                  3
    "BR"                                                                                                     "BR"                                                                                                     0
    "BR, FCR, Solumerol/Rituximab, Ibrutinib"                                                                "BR; FCR; Solumerol/Rituximab; Ibrutinib"                                                                4
    ""                                                                                                       ""                                                                                                       0
    "FCR, Venetoclax, ABT-199"                                                                               "FCR; Venetoclax; ABT-199"                                                                               3
    "BR, IVIG"                                                                                               "BR; IVIG"                                                                                               2
    "FCR"                                                                                                    "FCR"                                                                                                    0
    "EPOCH-R, DA-EPOCH-R"                                                                                    "EPOCH-R; DA-EPOCH-R"                                                                                    2
    "Prednisone/Chlorambucil, Rituximab, Ibrutinib"                                                          "Prednisone/Chlorambucil; Rituximab; Ibrutinib"                                                          3
    "FCR"                                                                                                    "FCR"                                                                                                    0
    ""                                                                                                       ""                                                                                                       0
    ""                                                                                                       ""                                                                                                       0
    "FCR"                                                                                                    "FCR"                                                                                                    0
    "Ibrutinib"                                                                                              "Ibrutinib"                                                                                              0
    "R-CVP, Rituximab (maintenance), FCR, Ibritumomab tiuxetan, Ibrutinib, ABT-199"                          "R-CVP; Rituximab (maintenance); FCR; Ibritumomab tiuxetan; Ibrutinib; ABT-199"                          6
    "Chlorambucil, Fludarabine/Rituximab, BR, Ibrutinib"                                                     "Chlorambucil; Fludarabine/Rituximab; BR; Ibrutinib"                                                     4
    ""                                                                                                       ""                                                                                                       0
    "FCR, BR"                                                                                                "FCR; BR"                                                                                                2
    "FCR, BR"                                                                                                "FCR; BR"                                                                                                2
    "Prednisone/Chlorambucil, FCR, BR"                                                                       "Prednisone/Chlorambucil; FCR; BR"                                                                       3
    ""                                                                                                       ""                                                                                                       0
    "BR"                                                                                                     "BR"                                                                                                     0
    "Fludarabine, FCR, Bendamustine, FC"                                                                     "Fludarabine; FCR; Bendamustine; FC"                                                                     4
    ""                                                                                                       ""                                                                                                       0
    ""                                                                                                       ""                                                                                                       0
    ""                                                                                                       ""                                                                                                       0
    "SCT"                                                                                                    "SCT"                                                                                                    0
    "Fludarabine/Rituximab, Alemtuzumab, RT, R-CHOP"                                                         "Fludarabine/Rituximab; Alemtuzumab; RT; R-CHOP"                                                         4
    ""                                                                                                       ""                                                                                                       0
    ""                                                                                                       ""                                                                                                       0
    "Fludarabine, Cyclophosphamide, RT, BR"                                                                  "Fludarabine; Cyclophosphamide; RT; BR"                                                                  4
    "FCR, Alemtuzumab, SCT"                                                                                  "FCR; Alemtuzumab; SCT"                                                                                  3
    "Chlorambucil/Prednisone/Allopurinol, Rituximab, CHOP, FCR, Alvocidib"                                   "Chlorambucil/Prednisone/Allopurinol; Rituximab; CHOP; FCR; Alvocidib"                                   5
    "Fludarabine/Cyclophosphamide, Alemtuzumab, Pentostatin/Cyclophosphamide/Rituximab"                      "Fludarabine/Cyclophosphamide; Alemtuzumab; Pentostatin/Cyclophosphamide/Rituximab"                      3
    ""                                                                                                       ""                                                                                                       0
    ""                                                                                                       ""                                                                                                       0
    "Chlorambucil/Allopurinol"                                                                               "Chlorambucil/Allopurinol"                                                                               0
    "Cyclophosphamide/Fludarabine"                                                                           "Cyclophosphamide/Fludarabine"                                                                           0
    "Fludarabine/Rituximab"                                                                                  "Fludarabine/Rituximab"                                                                                  0
    "FCR"                                                                                                    "FCR"                                                                                                    0
    ""                                                                                                       ""                                                                                                       0
    "FCR, Chlorambucil"                                                                                      "FCR; Chlorambucil"                                                                                      2
    ""                                                                                                       ""                                                                                                       0
    "None"                                                                                                   "None"                                                                                                   0
    "Cetuximab, Cetuximab/Fludarabine, Rituximab/Fludarabine/Cyclophosphamide, Rituximab, R-CHOP"            "Cetuximab; Cetuximab/Fludarabine; Rituximab/Fludarabine/Cyclophosphamide; Rituximab; R-CHOP"            5
    ""                                                                                                       ""                                                                                                       0
    "Rituximab, Chlorambucil, Fludarabine/Cyclophosphamide, Prednisone/Vincristine/Rituximab"                "Rituximab; Chlorambucil; Fludarabine/Cyclophosphamide; Prednisone/Vincristine/Rituximab"                4
    "CALGB trial 10101"                                                                                      "CALGB trial 10101"                                                                                      1
    "Cyclophosphamide, R-CVP"                                                                                "Cyclophosphamide; R-CVP"                                                                                2
    ""                                                                                                       ""                                                                                                       0
    "Fludarabine/Rituximab, Fludarabine/Alemtuzumab"                                                         "Fludarabine/Rituximab; Fludarabine/Alemtuzumab"                                                         2
    ""                                                                                                       ""                                                                                                       0
    "Fludarabine/Cyclophosphamide"                                                                           "Fludarabine/Cyclophosphamide"                                                                           0
    "CHOP, R-CHOP, FCR"                                                                                      "CHOP; R-CHOP; FCR"                                                                                      3
    ""                                                                                                       ""                                                                                                       0
    "Fludarabine, Rituximab, Fludarabine/Rituximab, CHOP, Alemtuzumab, Cyclophosphamide/Rituximab/Steroids"  "Fludarabine; Rituximab; Fludarabine/Rituximab; CHOP; Alemtuzumab; Cyclophosphamide/Rituximab/Steroids"  6
    "Chlorambucil/Prednisone, Fludarabine/Oblimersen/Cyclophosphamide"                                       "Chlorambucil/Prednisone; Fludarabine/Oblimersen/Cyclophosphamide"                                       2
    ""                                                                                                       ""                                                                                                       0
    "FCR"                                                                                                    "FCR"                                                                                                    0
    ""                                                                                                       ""                                                                                                       0
    "Fludarabine"                                                                                            "Fludarabine"                                                                                            0
    "IVIG/Prednisone, Fludarabine/Rituximab, Vincristine/Prednisone"                                         "IVIG/Prednisone; Fludarabine/Rituximab; Vincristine/Prednisone"                                         3
    "CHOPP, Fludarabine, Cyclophosphamide, Vincristine, Rituximab"                                           "CHOPP; Fludarabine; Cyclophosphamide; Vincristine; Rituximab"                                           5
    "Chlorambucil/Prednisone, CHOP, Fludarabine/Cyclophosphamide, Fludarabine"                               "Chlorambucil/Prednisone; CHOP; Fludarabine/Cyclophosphamide; Fludarabine"                               4
    ""                                                                                                       ""                                                                                                       0
    "Fludarabine/Mitoxantrone, Fludarabine/Rituximab"                                                        "Fludarabine/Mitoxantrone; Fludarabine/Rituximab"                                                        2
    "Fludarabine"                                                                                            "Fludarabine"                                                                                            0
    ""                                                                                                       ""                                                                                                       0
    "Fludarabine/Rituximab, Bendomustine/Rituximab, Imatinib, Ibrutinib/Nilotinib, Nilotinib"                "Fludarabine/Rituximab; Bendomustine/Rituximab; Imatinib; Ibrutinib/Nilotinib; Nilotinib"                5
    "Ibrutinib, R-CHOP"                                                                                      "Ibrutinib; R-CHOP"                                                                                      2
    "BR"                                                                                                     "BR"                                                                                                     0
    "Alemtuzumab/Ofatumomab, Ibrutinib, Venetoclax, Ibrutinib/Venetoclax, Obinutuzumab/Ibrutinib/Venetoclax" "Alemtuzumab/Ofatumomab; Ibrutinib; Venetoclax; Ibrutinib/Venetoclax; Obinutuzumab/Ibrutinib/Venetoclax" 5
    "BR"                                                                                                     "BR"                                                                                                     0
    "Chemotherapy (type not specified)"                                                                      "Chemotherapy (type not specified)"                                                                      0
    ""                                                                                                       ""                                                                                                       0
    "BR, R-CHOP, Ibrutinib, Venetoclax/Rituximab"                                                            "BR; R-CHOP; Ibrutinib; Venetoclax/Rituximab"                                                            4
    "Fludarabine/Cyclophosphamide"                                                                           "Fludarabine/Cyclophosphamide"                                                                           0
    ""                                                                                                       ""                                                                                                       0
    "BR, Ibrutinib, Venetoclax"                                                                              "BR; Ibrutinib; Venetoclax"                                                                              3
    "Ibrutinib, Acalabrutinib"                                                                               "Ibrutinib; Acalabrutinib"                                                                               2
    "FCR, Ibrutinib"                                                                                         "FCR; Ibrutinib"                                                                                         2
    "Ofatumumab/Fludarabine/Cyclophosphamide, Acalabrutinib"                                                 "Ofatumumab/Fludarabine/Cyclophosphamide; Acalabrutinib"                                                 2
    "FCR, Ibrutinib, Venetoclax"                                                                             "FCR; Ibrutinib; Venetoclax"                                                                             3
    "Ibrutinib"                                                                                              "Ibrutinib"                                                                                              0
    "Venetoclax/Ibrutinib"                                                                                   "Venetoclax/Ibrutinib"                                                                                   0
    "Acalabrutinib"                                                                                          "Acalabrutinib"                                                                                          0
    ""                                                                                                       ""                                                                                                       0
    "FCR, Rituximab"                                                                                         "FCR; Rituximab"                                                                                         2
    "Chlorambucil/Obinutuzumab"                                                                              "Chlorambucil/Obinutuzumab"                                                                              0
    "EBRT"                                                                                                   "EBRT"                                                                                                   0
    "Chemotherapy (type not specified)"                                                                      "Chemotherapy (type not specified)"                                                                      0
    "Ibrutinib, Venetoclax"                                                                                  "Ibrutinib; Venetoclax"                                                                                  2
    end

  • #2
    Code:
    gen shorter = subinstr(CLLtx, ",", "", .)
    replace shorter = subinstr(shorter, ";", "", .)
    gen clltxn = strlen(CLLtx) - strlen(shorter) + 1
    replace clltxn = 0 if inlist(CLLtx, "", "None", "No treatment given")

    Comment


    • #3
      I learnt something. That totally worked, Clyde. Thanks much. On a different note, how do I get rid of those strange characters depicted by a question mark inside a square box in the list window. Are these some spaces that need to be trimmed?
      Last edited by Girish Venkataraman; 14 Feb 2022, 20:02. Reason: posted by accident before completion.

      Comment


      • #4
        They are various non-printing characters. And they are not, in general, all the same. To get rid of them, you first need to find out exactly what they are. For that, I recommend the -chartab- program, by Robert Picard, available from SSC. Run that and you will get a complete description of all of the characters that appear in the variables you specify, along with a helpful description of what they "mean." You can then decide which ones you want to get rid of, and you can write a loop over those to remove them. I don't have a data set handy that presents this problem, but here's some code that shows the approach, illustrated by replace hyphens and periods with spaces in the make variable in the auto.dta set:

        Code:
        . sysuse auto, clear
        (1978 automobile data)
        
        . chartab make
        
           decimal  hexadecimal   character |     frequency    unique name
        ------------------------------------+----------------------------------------
                32       \u0020             |            81    SPACE
                45       \u002d       -     |             1    HYPHEN-MINUS
                46       \u002e       .     |            30    FULL STOP
                48       \u0030       0     |            11    DIGIT ZERO
                49       \u0031       1     |             3    DIGIT ONE
                50       \u0032       2     |             4    DIGIT TWO
                51       \u0033       3     |             1    DIGIT THREE
                52       \u0034       4     |             1    DIGIT FOUR
                53       \u0035       5     |             2    DIGIT FIVE
                54       \u0036       6     |             2    DIGIT SIX
                55       \u0037       7     |             1    DIGIT SEVEN
                56       \u0038       8     |             4    DIGIT EIGHT
                57       \u0039       9     |             1    DIGIT NINE
                65       \u0041       A     |             7    LATIN CAPITAL LETTER A
                66       \u0042       B     |             9    LATIN CAPITAL LETTER B
                67       \u0043       C     |            29    LATIN CAPITAL LETTER C
                68       \u0044       D     |            13    LATIN CAPITAL LETTER D
                69       \u0045       E     |             2    LATIN CAPITAL LETTER E
                70       \u0046       F     |             6    LATIN CAPITAL LETTER F
                71       \u0047       G     |             2    LATIN CAPITAL LETTER G
                72       \u0048       H     |             3    LATIN CAPITAL LETTER H
                73       \u0049       I     |             1    LATIN CAPITAL LETTER I
                76       \u004c       L     |             7    LATIN CAPITAL LETTER L
                77       \u004d       M     |            20    LATIN CAPITAL LETTER M
                78       \u004e       N     |             1    LATIN CAPITAL LETTER N
                79       \u004f       O     |             9    LATIN CAPITAL LETTER O
                80       \u0050       P     |            15    LATIN CAPITAL LETTER P
                82       \u0052       R     |             6    LATIN CAPITAL LETTER R
                83       \u0053       S     |            12    LATIN CAPITAL LETTER S
                84       \u0054       T     |             4    LATIN CAPITAL LETTER T
                86       \u0056       V     |             8    LATIN CAPITAL LETTER V
                87       \u0057       W     |             5    LATIN CAPITAL LETTER W
                88       \u0058       X     |             1    LATIN CAPITAL LETTER X
                90       \u005a       Z     |             1    LATIN CAPITAL LETTER Z
                97       \u0061       a     |            62    LATIN SMALL LETTER A
                98       \u0062       b     |             8    LATIN SMALL LETTER B
                99       \u0063       c     |            28    LATIN SMALL LETTER C
               100       \u0064       d     |            30    LATIN SMALL LETTER D
               101       \u0065       e     |            53    LATIN SMALL LETTER E
               102       \u0066       f     |             1    LATIN SMALL LETTER F
               103       \u0067       g     |            11    LATIN SMALL LETTER G
               104       \u0068       h     |            12    LATIN SMALL LETTER H
               105       \u0069       i     |            41    LATIN SMALL LETTER I
               107       \u006b       k     |            10    LATIN SMALL LETTER K
               108       \u006c       l     |            40    LATIN SMALL LETTER L
               109       \u006d       m     |            10    LATIN SMALL LETTER M
               110       \u006e       n     |            34    LATIN SMALL LETTER N
               111       \u006f       o     |            55    LATIN SMALL LETTER O
               112       \u0070       p     |             9    LATIN SMALL LETTER P
               113       \u0071       q     |             1    LATIN SMALL LETTER Q
               114       \u0072       r     |            46    LATIN SMALL LETTER R
               115       \u0073       s     |            22    LATIN SMALL LETTER S
               116       \u0074       t     |            37    LATIN SMALL LETTER T
               117       \u0075       u     |            27    LATIN SMALL LETTER U
               118       \u0076       v     |            13    LATIN SMALL LETTER V
               119       \u0077       w     |             1    LATIN SMALL LETTER W
               120       \u0078       x     |             3    LATIN SMALL LETTER X
               121       \u0079       y     |            11    LATIN SMALL LETTER Y
               122       \u007a       z     |             3    LATIN SMALL LETTER Z
        ------------------------------------+----------------------------------------
        
                                            freq. count   distinct
        ASCII characters              =             871         59
        Multibyte UTF-8 characters    =               0          0
        Unicode replacement character =               0          0
        Total Unicode characters      =             871         59
        
        
        . foreach n of numlist 45 46 {
          2. replace make = subinstr(make, "`=char(`n')'", " ", .)
          3. }
        (1 real change made)
        (30 real changes made)


        Comment


        • #5
          This is so useful to know. I will implement this in my dataset and trim those characters as appropriate. I sure learn something every day. Thanks once again.

          Comment


          • #6
            I installed the chartab program and identified that character to be a "REPLACEMENT CHARACTER" with decimal number as below
            65,533 \ufffd � | 12 REPLACEMENT CHARACTER

            I tried the code:

            replace CLLtx = usubinstr(CLLtx, "=char(65,533)", " ", .)

            But that does not remove the character even though the code does not throw an error and says 9 changes made.

            Comment


            • #7
              Ah, those are unicode characters, and you already identified that you need to use -usubinstr()- in place of -subinstr()-. The problem is that "=char(65,533)" is wrong in several ways. First, that whole expression needs to be wrapped inside local macro quotes (`...'). The second is that you cannot use a comma in the number. The third is that you have to use -uchar()- instead of -char()-.

              Code:
              replace CLLtx = usubinstr(CLLtx, "`=uchar(65533)'", "", .)

              Comment


              • #8
                Thanks once again for the encouragement and corrections to my code. This worked.

                Comment

                Working...
                X