Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to use moss or regexm to find all occurrences of two patterns in a variable

    Code:
    input str100 result str100 f
    "PR: L63P A71T V77I"    "L63P A71T V77I"
    "RT: A98S K104R E122K I135V D177E T200A Q207E R211K L214F V245M"    "A98S K104R E122K I135V D177E T200A Q207E R211K"
    "PR: E35ED S37N R41K I72L"    "E35ED S37N R41K I72L"
    "ATV Mutations: A71T"    "A71T"
    "ATV/r Mutations: L63P A71T V77I"    "L63P A71T V77I"
    "DRV/r Mutations: L63P A71T V77I"    "L63P A71T V77I"
    "AMP Mutations: A71T"    "A71T"
    "AMP/r Mutations: A71T"  "A71T"
    "IDV Mutations: A71T V77I"    "A71T V77I"
    "IDV/r Mutations: A71T V77I"    "A71T V77I"
    "LPV/r Mutations: L63P A71T"    "L63P A71T"
    "NFV Mutations: A71T"    "A71T"
    "SQV/r Mutations: A71T V77I"    "A71T V77I"
    "Protease: L63P A71T/A"    "L63P A71T"
    "PR: L63P V77I"    "L63P V77I"
    "RT: E122K D123E I178L G196E T200I L214F V245E"    "E122K D123E I178L G196E T200I L214F V245E"
    end
    
    
    gen f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+)"))
    replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
    replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
    replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
    replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
    replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
    replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
    replace f = regexs(0) if (regexm(result, "([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+) ([A-Z][0-9]+[A-Z]+)"))
    f is what the second block of code generated but I input it as data for your convenience.
    I am facing two issues.
    I have a pattern ([A-Z][0-9]+[A-Z]+) and another one ([A-Z][0-9]+[A-Z]+/[A-Z]+)
    I do not know if regexm or moss can be used to return more than one pattern at a time, all instances of such.
    The regexm series I used is a very inefficient way to extract all instances of only ONE pattern, in addition there is a limit, I believe. The maximum such sequences number 13 in my real data, while regexm may not go beyond 10? The error returned is "regexp: too many ()"

    In observation #14 I would like the command to return L63P and A71T/A, in other words, two patterns need to be specified for the command to look for.

  • #2
    I am aware that the second block of code is a folly but that is what I could think of before asking for help.

    Comment


    • #3
      Hi Saurabh,

      I think this is a matter of how complex you formulate the regular expression to match; you don't need to repeat the pattern to match, you just can define a repeating sub-pattern inside your expression. The point you missed is that you are allowed to nest parentheses in regular expressions.

      This also means that, as your second pattern is just adds "/[A-Z+]" to the first one, both can be concatenated to a single expression: "([A-Z][0-9]+[A-Z]+(/[A-Z]+)?)".

      The following code does the trick for me, at least if I understood your wish correctly:
      Code:
      clear
      input str100 result str100 f
      "PR: L63P A71T V77I"    "L63P A71T V77I"
      "RT: A98S K104R E122K I135V D177E T200A Q207E R211K L214F V245M"    "A98S K104R E122K I135V D177E T200A Q207E R211K"
      "PR: E35ED S37N R41K I72L"    "E35ED S37N R41K I72L"
      "ATV Mutations: A71T"    "A71T"
      "ATV/r Mutations: L63P A71T V77I"    "L63P A71T V77I"
      "DRV/r Mutations: L63P A71T V77I"    "L63P A71T V77I"
      "AMP Mutations: A71T"    "A71T"
      "AMP/r Mutations: A71T"  "A71T"
      "IDV Mutations: A71T V77I"    "A71T V77I"
      "IDV/r Mutations: A71T V77I"    "A71T V77I"
      "LPV/r Mutations: L63P A71T"    "L63P A71T"
      "NFV Mutations: A71T"    "A71T"
      "SQV/r Mutations: A71T V77I"    "A71T V77I"
      "Protease: L63P A71T/A"    "L63P A71T"
      "PR: L63P V77I"    "L63P V77I"
      "RT: E122K D123E I178L G196E T200I L214F V245E"    "E122K D123E I178L G196E T200I L214F V245E"
      end
      
      generate myf=regexs(1) if (regexm(result,"(([A-Z][0-9]+[A-Z]+(/[A-Z]+)? ?)+)"))
      list
      Does this help?

      Regards
      Bela

      Comment


      • #4
        Indeed it helps and I did miss that point of nesting expressions.
        Thank you Bela. The code extracts exactly what I wanted and the regexs(1) made sure to get all such instances.
        Thanks again.

        Comment


        • #5
          I'm glad this helped. Just as a final remark: After a second thought, you could also solve the issue without regular expressions at all. To me, it seems that the part of the string you want to extract is always the part that is preceded by a colon.

          If this is true, the following would also do the trick:
          Code:
          generate myf=trim(substr(result,strpos(result,":")+1,.))
          Regards
          Bela

          Comment


          • #6
            Hi, I have a somewhat similar problem.
            I have strings that contain adresses in the form of "Neighborhood Municipality". They look something like this:
            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input str56 origin
            "neighborhoodA municipalityA"                             
            "neighborhoodBA neighborhoodBB municipalityA"              
            "neighborhoodA municipalityB"                            
            "neighborhoodBA neighborhoodBB municipalityBA municipalityBB"
            end
            I need to extract the municipality from the string.
            The problem is that both the name of the municipality and the neighborhood may be composed by more than one word so this may be a little complicated.
            I have the list of municipalities so I'm using it to identify the name in the string.
            So far I've started identifying municipalities with a one-word name. Now I want to move on to municipalities with two-word names and so on.
            So basically I need to be able to identify the last word of the string, then the two last words and so on.
            I've tried using the regex functions but I still have problems using it. Any ideas?
            Thanks in advance!

            Comment


            • #7
              #5 Bela, you are right but to keep the example simple, I did not include the more complex rows of data. There weren't always colons before the data of interest and conversely, colons were also followed by useless chunks of information. But thanks for teaching me the different approach; it is definitely an elegant solution for more standard data. Thank you so much.

              Comment


              • #8
                Saurabh Chavan Being this so, we may prefer a rather minimalistic approach, with the same results:

                Code:
                split result, p(":")
                Last edited by Marcos Almeida; 04 Jan 2018, 07:31.
                Best regards,

                Marcos

                Comment


                • #9
                  Marcos Almeida Thank you. I have used split before when the fields were populated by standardized strings. In this case however, they are an entire lab result note split into multiple rows and only some rows have the genetic information I am looking for and even then, not all the information in that row is mutations (the part I need to extract). This is besides the fact that the mutations themselves follow a pattern that is some times generalizable and some times not. My real quandary here, I suppose was, given the variations in the patterns of the mutation sequences, how best can I generalize the expression for regexs to be successfully used. Nevertheless, like I said, if the data are more standardized and have less junk, split would definitely do the trick!
                  Thanks again.
                  Saurabh

                  Comment


                  • #10
                    Santiago Cantillo If you are not looking at a very long list of names of municipalities, an inelegant but functional solution might be strmatch?

                    Comment

                    Working...
                    X