Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String variables

    Hi STATALIST,

    I have a string variable and detected ga68 out of it by:


    Code:
    gen ga68 = (strpos(lower(var1), "ga") > 0) & ///
                       (strpos(lower(var1), "68") > 0)

    ga68 is 1 if var1 includes ga68 or 68ga.

    I need to pick up ga68 only because the order is important. I don't want "ga68" because it could be spaces or non-numeric characters between ga and 68. (e.g. ga 68, ga _ 68, ..)

    Could you please let me know your advice?


    Regards,









    Last edited by Masoumeh Sanagou; 01 Oct 2020, 20:14.

  • #2
    Well, if anything could appear between the ga and 68 and all that matters is the order you can do:

    Code:
    gen ga68 = strmatch(lower(var1), "*ga*68*")
    But this will also pick up things like gab68.

    Comment


    • #3
      How about:
      ga68=0 if numbers and alphabet appear between them and ga68=1 if other characters (e.g. - , /, _) and spaces appear between them ?

      Regards,

      Comment


      • #4
        Originally posted by Masoumeh Sanagou View Post
        I need to pick up ga68 only because the order is important.
        Well, you're using the function strpos() and that gives you string position, and not only presence. Use the position information that the function gives you too in order to discern the relative position that you seek.

        Comment


        • #5
          Re #3: in this situation, you would need to use the regular expression functions in Stata. See -help regexm()-.

          Comment


          • #6
            Thanks for all advice.

            Code:
            gen a=regexm(lower(var1),  "[p][t][^a-z0-9]*[o][t][h][e][r]")
            
            gen b=regexm(lowervar1),  "[pt][^a-z0-9]*[o][t][h][e][r]")
            
            gen c=regexm(lower(var1),  "[p][t][^a-z0-9]*[other]")
            
            
            gen e=regexm(lower(var1),  "[pt][^a-z0-9]*[other]")

            Why e is not equal to a? (a=b=c)

            Regards,

            Comment


            • #7
              It isn't apparent to me why e should not be the same as a, b, and c. Please use the -dataex- command to post some example data that illustrates the difference. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

              Comment


              • #8
                Thank you for the reply.




                Code:
                gen a=regexm(lower(var1), "[p][t][^a-z0-9]*[o][t][h][e][r]")
                gen b=regexm(lowervar1), "[pt][^a-z0-9]*[o][t][h][e][r]")
                gen c=regexm(lower(var1), "[p][t][^a-z0-9]*[other]")
                
                gen e=regexm(lower(var1), "[pt][^a-z0-9]*[other]")


                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str125 var1 float(a b c e)
                "CT Abdomen and Pelvis"                        0 0 0 1
                "CT Brain"                                     0 0 0 0
                "CT Injection"                                 0 0 0 0
                "CT Angiogram Abdominal Aorta"                 0 0 0 0
                "CT Angiogram Pulmonary CTPA"                  0 0 0 0
                "CT 4D Tracheomalacia Dynamic Airways"         0 0 0 1
                "CT Humerus Right"                             0 0 0 1
                "PT Other F Torso"                             1 1 1 1
                "PT Other F Torso+Diag CT"                     1 1 1 1
                "PT Other"                                     1 1 1 1
                end

                I really appreciate your time and help.
                Regards,

                Comment


                • #9
                  If possible use the more general ustrregexm(), and other ustrregex functions, which use the ICU regex library documented at http://userguide.icu-project.org/strings/regexp

                  Why e is not equal to a?
                  You use regex charater class [] , and quatifier (*) in your regex:

                  A character class [] accept ANY ONE of the characters within the square brackets.

                  Thus, [pt] ( one of "p" OR "t") is not the same as [p][t] ( a "p" followed by a "t"), and your "[o][t][h][e][r]" is better expressed as the string "other"

                  Code:
                  di regexm("other", "[o][t][h][e][r]")
                  di regexm("other", "other")
                  di regexm("other", "[other]")
                  di regexm("r", "[other]")
                  your pattern "[pt][^a-z0-9]*[other]" can be described as;
                  1. [pt] match a single character in [pt]
                  2. [^a-z0-9] match a single character NOT present in [^a-z0-9]
                  3. * zero or more times, as many times as possible
                  4. [other] match a single character in [other]
                  thus the match in the string "ct abdomen and pelvis" will be (in bold)

                  ct abdomen and pelvis
                  Last edited by Bjarte Aagnes; 04 Oct 2020, 04:53.

                  Comment

                  Working...
                  X