Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting last part of a string that occurs after a comma

    Hi

    I have data stored under variable name ad (str199) which looks like this

    1 cawood av, kathu, kathu, northern cape
    1 hanger 30, 11 lanseria airport st, lanseria airport, randburg, gauteng
    12 pretoria str, kemptonpark, kempton park

    I want to create 2 variables: 1 that contains the phrase that occurs after the last comma (so, "northern cape" "gauteng" "kempton park"), and variable 2 that contains the phrase that occurs after the second last comma (so, "kathu, northern cape" "randburg, guateng" "kemptonpark, kempton park")

    I'm trying to use -split- or -regexm- but can't work out how to do it! Any help would be appreciated!

    Thanks
    Megan


  • #2
    Here is a solution using regular expressions which in this case I find simpler than string functions.

    Code:
    clear
    inp str199 ad
    "1 cawood av, kathu, kathu, northern cape"
    "1 hanger 30, 11 lanseria airport st, lanseria airport, randburg, gauteng"
    "12 pretoria str, kemptonpark, kempton park"
    end
    
    list
    
    generate var1 = regexs(2) if regexm(ad, "^.*,(.*),(.*)$")
    generate var2 = regexs(1) if regexm(ad, "^.*,(.*,.*)$")
    
    list
    Best
    Daniel
    Last edited by daniel klein; 20 Oct 2016, 05:51.

    Comment


    • #3
      There are lots of ways to do this. Here are two:

      Code:
      clear
      input str21 whatever
      "frog,toad,newt"
      "frog,toad,newt,dragon"
      end
      gen reversed = reverse(whatever)
      split reversed, p(,) limit(2)
      quietly forval j = 1/2 {
          replace reversed`j' = reverse(reversed`j')
          rename reversed`j' whatever`j'
      }
      
      list
      
           +---------------------------------------------------------------------+
           |              whatever                reversed   whatev~1   whatev~2 |
           |---------------------------------------------------------------------|
        1. |        frog,toad,newt          twen,daot,gorf       newt       toad |
        2. | frog,toad,newt,dragon   nogard,twen,daot,gorf     dragon       newt |
           +---------------------------------------------------------------------+
      
      
      gen lastcommapos = strrpos(whatever, ",")
      gen part1 = substr(whatever, lastcommapos + 1, .)
      gen nextcommapos = strrpos(substr(whatever, 1, lastcommapos - 1), ",")
      gen part2 = substr(whatever, nextcommapos + 1, lastcommapos - nextcommapos - 1)
      Crossed with Daniel's. Here string functions appeal to me because I've been using them much longer than regex.
      Last edited by Nick Cox; 20 Oct 2016, 06:03.

      Comment


      • #4
        While I agree with Daniel, that regular expressions is the best solution to this problem, it is possible to do the same using only string functions:
        Code:
        clear
        input str199 ad
        "1 cawood av, kathu, kathu, northern cape"
        "1 hanger 30, 11 lanseria airport st, lanseria airport, randburg, gauteng"
        "12 pretoria str, kemptonpark, kempton park"
        end
        
        gen var1 = substr(ad, strrpos(ad, ",") + 2, length(ad))
        gen var2 = substr(ad, ustrrpos(ad, ",", strrpos(ad, ",") - 1) + 2, length(ad))
        
        list, noobs
        Result:
        Code:
        . list, noobs
        
          +----------------------------------------------------------------------------------------------------------------------+
          |                                                                       ad            var1                        var2 |
          |----------------------------------------------------------------------------------------------------------------------|
          |                                 1 cawood av, kathu, kathu, northern cape   northern cape        kathu, northern cape |
          | 1 hanger 30, 11 lanseria airport st, lanseria airport, randburg, gauteng         gauteng           randburg, gauteng |
          |                               12 pretoria str, kemptonpark, kempton park    kempton park   kemptonpark, kempton park |
          +----------------------------------------------------------------------------------------------------------------------+

        Comment


        • #5
          Note that Mathias's solution hinges on commas always being followed by spaces.

          I wouldn't assume that as a universal. Note that you can always use trim() to remove leading and/or trailing spaces.

          Comment


          • #6
            Hi, thank you so much these are exactly what I needed! Have a great day all

            Comment


            • #7
              Hi all,

              Apologies for revisiting this thread, I just have a related question on this topic.

              What if someone had similar data to Megan's and still wanted to extract the last part of the string, however it was no longer separated by a comma but by a space. Therefore it would be e.g.
              Code:
              northern cape
              and the researcher would just want to extract
              Code:
              cape
              ?

              Comment


              • #8
                What follows the last space is to Stata the last word (if not on anything, then in that particular string):

                Code:
                . display word("This is how to get what you want", -1)
                want

                Comment


                • #9
                  A white space is denoted by the character "\s" in ICU regular expressions. This suggests the following modification to #2

                  Code:
                  * Example generated by -dataex-. For more info, type help dataex
                  clear
                  input str20 text
                  "2400 North Cape"
                  "88 South Beach"
                  "44 West Side"  
                  "21 East Bay"    
                  end
                  
                  gen wanted= ustrregexra(text, "(.*\s)(.*)", "$2")
                  Res.:

                  Code:
                  . gen wanted= ustrregexra(text, "(.*\s)(.*)", "$2")
                  
                  . l
                  
                       +--------------------------+
                       |            text   wanted |
                       |--------------------------|
                    1. | 2400 North Cape     Cape |
                    2. |  88 South Beach    Beach |
                    3. |    44 West Side     Side |
                    4. |     21 East Bay      Bay |
                       +--------------------------+
                  Last edited by Andrew Musau; 03 Oct 2021, 03:51.

                  Comment


                  • #10
                    Hi all,

                    Apologies for revisiting this thread again but my question is also related.

                    What if someone had similar data and wants to extract the two strings following the dot. Therefore it would be e.g.

                    Code:
                    420.22YZ4
                    and one wants to extract solely

                    Code:
                    22
                    Thank you very much in advance!


                    Comment


                    • #11
                      Code:
                      . clear
                      
                      . set obs 1
                      Number of observations (_N) was 0, now 1.
                      
                      . gen test = "420.22YZ4"
                      
                      . gen wanted = substr(test, strpos(test, ".") + 1, 2)
                      
                      . l
                      
                           +--------------------+
                           |      test   wanted |
                           |--------------------|
                        1. | 420.22YZ4       22 |
                           +--------------------+

                      Comment


                      • #12
                        I am guessing that the lengths of the digits to be extracted may be variable. In that case, the following regex does it:

                        Code:
                        clear
                        input str12 mystring
                        "420.22YZ4"
                        "3000.459Z22"
                        end
                        
                        gen wanted= ustrregexra(mystring,  "(.*\.)(\d+)(\w+)", "$2")
                        Res.:

                        Code:
                        . gen wanted= ustrregexra(mystring,  "(.*\.)(\d+)(\w+)", "$2")
                        
                        . l
                        
                             +----------------------+
                             |    mystring   wanted |
                             |----------------------|
                          1. |   420.22YZ4       22 |
                          2. | 3000.459Z22      459 |
                             +----------------------+

                        Comment


                        • #13
                          Thank you very much for your help, it worked perfectly.

                          I still have another related question:

                          some of my observations contain information on how often the patient got the medicine, e.g. they look the following way: 420.225X74, where the part 5X7 refers to 5 times a day, seven days a week. Is it possible to extract this information even if some observations do not have the 5X7 information?

                          So I want to extract it like this:
                          Code:
                          420.225X74
                          Code:
                          5X7
                          Is it possible to adapt the code of Nick and extract the value before and after the X ? e.g.,
                          Code:
                             
                           gen wanted = substr(test, strpos(test, "X") + 1, 2)
                          Thank you very much in advance!

                          Comment


                          • #14
                            I think you need

                            Code:
                             
                             substr(test, strpos(test, "X") - 1, 3)
                            although for problems like that I too would start to reach for regular expressions. Much depends on whether

                            X has another possible meaning so that you need to exclude other patterns.

                            x rather than X might occur some of the time.

                            The second is easier, as in

                            Code:
                             
                             substr(test, strpos(upper(test), "X") - 1, 3)

                            Comment


                            • #15
                              Thank you very much, the second case works properly but in some cases for example for the observation

                              Code:
                              123.19
                              the newly generate variable with the code

                              Code:
                               
                                substr(test, strpos(upper(test), "X") - 1, 3)
                              displays

                              Code:
                              9
                              Even though there is no X in the observation.

                              Can one restrict the command to only show apply when there is an X in the observation and nothing instead?

                              Comment

                              Working...
                              X