Extracting last part of a string that occurs after a comma

Megan Little

Join Date: Sep 2015

Posts: 35
#1

Extracting last part of a string that occurs after a comma

20 Oct 2016, 04:39

Hi

I have data stored under variable name ad (str199) which looks like this

1 cawood av, kathu, kathu, northern cape
1 hanger 30, 11 lanseria airport st, lanseria airport, randburg, gauteng
12 pretoria str, kemptonpark, kempton park

I want to create 2 variables: 1 that contains the phrase that occurs after the last comma (so, "northern cape" "gauteng" "kempton park"), and variable 2 that contains the phrase that occurs after the second last comma (so, "kathu, northern cape" "randburg, guateng" "kemptonpark, kempton park")

I'm trying to use -split- or -regexm- but can't work out how to do it! Any help would be appreciated!

Thanks
Megan
Tags: None

daniel klein

Join Date: Mar 2014
Posts: 3824

20 Oct 2016, 04:49

Here is a solution using regular expressions which in this case I find simpler than string functions.

Code:

clear
inp str199 ad
"1 cawood av, kathu, kathu, northern cape"
"1 hanger 30, 11 lanseria airport st, lanseria airport, randburg, gauteng"
"12 pretoria str, kemptonpark, kempton park"
end

list

generate var1 = regexs(2) if regexm(ad, "^.*,(.*),(.*)$")
generate var2 = regexs(1) if regexm(ad, "^.*,(.*,.*)$")

list

Best
Daniel

Last edited by daniel klein; 20 Oct 2016, 04:51.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35451

20 Oct 2016, 04:54

There are lots of ways to do this. Here are two:

Code:

clear
input str21 whatever
"frog,toad,newt"
"frog,toad,newt,dragon"
end
gen reversed = reverse(whatever)
split reversed, p(,) limit(2)
quietly forval j = 1/2 {
    replace reversed`j' = reverse(reversed`j')
    rename reversed`j' whatever`j'
}

list

     +---------------------------------------------------------------------+
     |              whatever                reversed   whatev~1   whatev~2 |
     |---------------------------------------------------------------------|
  1. |        frog,toad,newt          twen,daot,gorf       newt       toad |
  2. | frog,toad,newt,dragon   nogard,twen,daot,gorf     dragon       newt |
     +---------------------------------------------------------------------+


gen lastcommapos = strrpos(whatever, ",")
gen part1 = substr(whatever, lastcommapos + 1, .)
gen nextcommapos = strrpos(substr(whatever, 1, lastcommapos - 1), ",")
gen part2 = substr(whatever, nextcommapos + 1, lastcommapos - nextcommapos - 1)

Crossed with Daniel's. Here string functions appeal to me because I've been using them much longer than regex.

Last edited by Nick Cox; 20 Oct 2016, 05:03.

Comment

Mathias Pedersen Heinze

Join Date: Jun 2015
Posts: 78

20 Oct 2016, 04:58

While I agree with Daniel, that regular expressions is the best solution to this problem, it is possible to do the same using only string functions:

Code:

clear
input str199 ad
"1 cawood av, kathu, kathu, northern cape"
"1 hanger 30, 11 lanseria airport st, lanseria airport, randburg, gauteng"
"12 pretoria str, kemptonpark, kempton park"
end

gen var1 = substr(ad, strrpos(ad, ",") + 2, length(ad))
gen var2 = substr(ad, ustrrpos(ad, ",", strrpos(ad, ",") - 1) + 2, length(ad))

list, noobs

Result:

Code:

. list, noobs

  +----------------------------------------------------------------------------------------------------------------------+
  |                                                                       ad            var1                        var2 |
  |----------------------------------------------------------------------------------------------------------------------|
  |                                 1 cawood av, kathu, kathu, northern cape   northern cape        kathu, northern cape |
  | 1 hanger 30, 11 lanseria airport st, lanseria airport, randburg, gauteng         gauteng           randburg, gauteng |
  |                               12 pretoria str, kemptonpark, kempton park    kempton park   kemptonpark, kempton park |
  +----------------------------------------------------------------------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35451
#5

20 Oct 2016, 05:12

Note that Mathias's solution hinges on commas always being followed by spaces.

I wouldn't assume that as a universal. Note that you can always use trim() to remove leading and/or trailing spaces.
Comment
Megan Little

Join Date: Sep 2015

Posts: 35
#6

20 Oct 2016, 05:30

Hi, thank you so much these are exactly what I needed! Have a great day all
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 650
#7

03 Oct 2021, 00:35

Hi all,

Apologies for revisiting this thread, I just have a related question on this topic.

What if someone had similar data to Megan's and still wanted to extract the last part of the string, however it was no longer separated by a comma but by a space. Therefore it would be e.g.

Code:

northern cape

and the researcher would just want to extract

Code:

cape

?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35451
#8

03 Oct 2021, 02:25

What follows the last space is to Stata the last word (if not on anything, then in that particular string):

Code:

. display word("This is how to get what you want", -1) want
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10091

03 Oct 2021, 02:47

A white space is denoted by the character "\s" in ICU regular expressions. This suggests the following modification to #2

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str20 text
"2400 North Cape"
"88 South Beach"
"44 West Side"  
"21 East Bay"    
end

gen wanted= ustrregexra(text, "(.*\s)(.*)", "$2")

Res.:

Code:

. gen wanted= ustrregexra(text, "(.*\s)(.*)", "$2")

. l

     +--------------------------+
     |            text   wanted |
     |--------------------------|
  1. | 2400 North Cape     Cape |
  2. |  88 South Beach    Beach |
  3. |    44 West Side     Side |
  4. |     21 East Bay      Bay |
     +--------------------------+

Last edited by Andrew Musau; 03 Oct 2021, 02:51.

Comment

Lea Birm

Join Date: Feb 2020

Posts: 54
#10

03 Feb 2022, 07:18

Hi all,

Apologies for revisiting this thread again but my question is also related.

What if someone had similar data and wants to extract the two strings following the dot. Therefore it would be e.g.

Code:

420.22YZ4

and one wants to extract solely

Code:

22

Thank you very much in advance!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35451

#11

03 Feb 2022, 07:27

Code:

. clear

. set obs 1
Number of observations (_N) was 0, now 1.

. gen test = "420.22YZ4"

. gen wanted = substr(test, strpos(test, ".") + 1, 2)

. l

     +--------------------+
     |      test   wanted |
     |--------------------|
  1. | 420.22YZ4       22 |
     +--------------------+

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10091

#12

03 Feb 2022, 07:41

I am guessing that the lengths of the digits to be extracted may be variable. In that case, the following regex does it:

Code:

clear
input str12 mystring
"420.22YZ4"
"3000.459Z22"
end

gen wanted= ustrregexra(mystring,  "(.*\.)(\d+)(\w+)", "$2")

Res.:

Code:

. gen wanted= ustrregexra(mystring,  "(.*\.)(\d+)(\w+)", "$2")

. l

     +----------------------+
     |    mystring   wanted |
     |----------------------|
  1. |   420.22YZ4       22 |
  2. | 3000.459Z22      459 |
     +----------------------+

Comment

Lea Birm

Join Date: Feb 2020

Posts: 54
#13

08 Feb 2022, 03:24

Thank you very much for your help, it worked perfectly.

I still have another related question:

some of my observations contain information on how often the patient got the medicine, e.g. they look the following way: 420.225X74, where the part 5X7 refers to 5 times a day, seven days a week. Is it possible to extract this information even if some observations do not have the 5X7 information?

So I want to extract it like this:

Code:

420.225X74

Code:

5X7

Is it possible to adapt the code of Nick and extract the value before and after the X ? e.g.,

Code:

gen wanted = substr(test, strpos(test, "X") + 1, 2)

Thank you very much in advance!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35451
#14

08 Feb 2022, 03:33

I think you need

Code:

substr(test, strpos(test, "X") - 1, 3)

although for problems like that I too would start to reach for regular expressions. Much depends on whether

X has another possible meaning so that you need to exclude other patterns.

x rather than X might occur some of the time.

The second is easier, as in

Code:

substr(test, strpos(upper(test), "X") - 1, 3)
Comment
Lea Birm

Join Date: Feb 2020

Posts: 54
#15

08 Feb 2022, 03:50

Thank you very much, the second case works properly but in some cases for example for the observation

Code:

123.19

the newly generate variable with the code

Code:

substr(test, strpos(upper(test), "X") - 1, 3)

displays

Code:

9

Even though there is no X in the observation.

Can one restrict the command to only show apply when there is an X in the observation and nothing instead?
Comment

Announcement