How to extract part of string between two symbols?

Eric Li

Join Date: Feb 2016

Posts: 12
#1

How to extract part of string between two symbols?

25 Nov 2018, 14:59

I have a string variable and try to keep part of this variable.
This variable is a string and the value is "Series 6.1: Complementary indicators - Gross propensity - Units".
I tried to use regexm on it and I would like get "Gross propensity" from this string.
So,how to extract part of string between two symbols?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

25 Nov 2018, 15:15

Your question is not very clear and specific. I'll assume you want to extract whatever is between occurrences of the - character, and that each value of this variable contains at most one pair of hyphens like that.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str63 var1 "Series 6.1: Complementary indicators - Gross propensity - Units" end split var1, gen(part) parse("-") gen wanted = part2 if !missing(part3)

The final step deals with the possibility that some value of the string might contain an incidental - but not a pair of them, so there would be nothing to extract.

If your situation is more complicated than this approach can handle, do post back with a fuller data example that exhibits the difficulties. Be sure to use the -dataex- command for that. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

25 Nov 2018, 15:19

You may use this:

Code:

. input str100 var1

                                                                                                     var1
  1. "Series 6.1: Complementary indicators - Gross propensity - Units"
  2. end

. gen var2 = var1

. split var2, parse(-) gen(myvar)
variables created as string:
myvar1  myvar2  myvar3

. drop myvar1 myvar3

. list

     +--------------------------------------------------------------------------------------+
  1. |                                                                      var1            |
     |           Series 6.1: Complementary indicators - Gross propensity - Units            |
     |--------------------------------------------------------------------------------------|
     |                                                            var2 |             myvar2 |
     | Series 6.1: Complementary indicators - Gross propensity - Units |  Gross propensity  |
     +--------------------------------------------------------------------------------------+

Notice that I prefer to deal with variable's "editions" by creating first a to-edit variable, just in case "something wrong" happens. Moreover, this way we preserve the original data.

Hopefully that helps

P.S: Crossed with Clyde's reply.

Last edited by Marcos Almeida; 25 Nov 2018, 15:23.

Best regards,

Marcos

Comment

David Benson

Join Date: Oct 2018
Posts: 489

25 Nov 2018, 18:57

Hi Eric,

The code provided by Clyde & Marcos will work (and be far easier than what I am about to demonstrate). I just am showing an alternate strategy so that you and others are aware of it.

The alternate strategy is to use strpos() to locate the position of the first and last "-" and then use subinstr() to extract the text between the two positions. This would work, for example, if you wanted to extract the "Complementary indicators" text between the colon and the first dash.

Since you are using the Tourism Statistics database put out by the UN World Tourism Organization (see here), I thought I would continue with it.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str119 var1
"Series 6.3: Complementary indicators - Inbound tourism expenditure over GDP - Percent"                                  
"Series 6.4: Complementary indicators - Outbound tourism expenditure over GDP - Percent"                                 
"Series 6.5: Complementary indicators - Tourism balance (inbound minus outbound tourism expenditure) over GDP - Percent" 
"Series 6.6: Complementary indicators - Tourism openness (inbound plus outbound tourism expenditure) over GDP  - Percent"
"Series 5.3: Employment - ♦ Other accommodation services - ('000)"                                                     
"Series 5.4: Employment - ♦ Food and beverage serving activities - ('000)"                                             
"Series 5.5: Employment - ♦ Passenger transportation - ('000)"                                                         
"Series 5.6: Employment - ♦ Travel agencies and other reservation services activities - ('000)"                        
end

Code:

gen pos1 = strpos( var1, "-")
label var pos1 "Position of first dash"
gen pos2 = strrpos( var1, "-")  // Note that this is strrpos(), not strpos()
label var pos2 "Position of last dash"
gen short_desc2 = substr( var1, pos1 + 1, pos2 - pos1 - 2)


* This next part is just to remove the ♦ character in the string (also to remove leading and trailing spaces)
ssc install charlist  // allows you to see what ASCII characters are in the string

. charlist var1
 '()-.03456:CDEFGIOPSTabcdeghilmnoprstuvxy���


. display r(ascii)
32 39 40 41 45 46 48 51 52 53 54 58 67 68 69 70 71 73 79 80 83 84 97 98 99 100 101 103 104 105 108 109 110 111 112 114 115 116 117 118 120 121
>  153 166 226 

* Could have put these in a loop (foreach i in 153 166 226)
replace short_desc2 = subinstr( short_desc2, char(153), "",.)  // The next 3 are what make up the ♦
replace short_desc2 = subinstr( short_desc2, char(166), "",.)
replace short_desc2 = subinstr( short_desc2, char(226), "",.)
replace short_desc2 = trim( short_desc2)  // just trimming extra spaces
replace short_desc2 = itrim( short_desc2)

Announcement