Using substring functions in Stata 16.1

Amani Siyam

Join Date: Aug 2014

Posts: 12
#1

Using substring functions in Stata 16.1

23 Oct 2021, 04:18

Dear Statalist,

I have a string variable "comment" stored as "strL" that contains a mix of numbers, characters and spaces .

storage display value
variable name type format label variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
comment strL %9s comment

I need to extract information from it for the source of data the record comes from, one source can be "SoWMy", the other "NRI", ....etc potentially 10 different data sources. I have about 28,000 records.

I am a bit lost on which substring command I should use.

I tried "gen source1 = regexs(1) if(regexm(comment, "SoWMY"))" and got the error "invalid number, outside of allowed range" -

Any advice would be most appreciated.

Kind regards,

Amani
Tags: None
Andrew Musau

Join Date: Oct 2014

Posts: 10089
#2

23 Oct 2021, 05:06

Case insensitive: Either

Code:

gen source1= ustrregexm(comment, "(?i)sowmy")

or

Code:

gen source1= strpos(lower(comment), "sowmy")>0

Last edited by Andrew Musau; 23 Oct 2021, 05:14.
Comment
Amani Siyam

Join Date: Aug 2014

Posts: 12
#3

23 Oct 2021, 06:24

Great thanks Andrew, the "ustrregexm" works not the "strpos" one - not sure why - kindly see below

. gen source1a= ustrregexm(comment, "(?i)SoWMy")

. gen source1b= strpos(lower(comment), "SoWMy")>0

. tab1 source1*

-> tabulation of source1a

source1a | Freq. Percent Cum.
------------+-----------------------------------
0 | 26,261 95.14 95.14
1 | 1,341 4.86 100.00
------------+-----------------------------------
Total | 27,602 100.00

-> tabulation of source1b

source1b | Freq. Percent Cum.
------------+-----------------------------------
0 | 27,602 100.00 100.00
------------+-----------------------------------
Total | 27,602 100.00

Cheers,

Amani
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10089
#4

23 Oct 2021, 06:34

Notice the lower case specification in my code:

gen source1= strpos(lower(comment), "sowmy")>0

If you want case-sensitive matching, then

Code:

gen source1= strpos(comment, "SoWMy")>0

and correspondingly, remove "(?i)" in the ustrregexm command.
Comment
Amani Siyam

Join Date: Aug 2014

Posts: 12
#5

24 Oct 2021, 03:14

Dear Andrew,
Thanks very much for the tip. I think "ustrregexm" command is more accurate than "strpo" - see me two examples below:

/* Example 1 */

. gen source_MOH1=ustrregexm(comment, "MOH")
. gen source_MOH2=strpos(comment, "MOH")>0
. tab1 source_MOH1 source_MOH2

/* That gives identical results */

-> tabulation of source_MOH1
source_MOH1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 27,042 97.97 97.97
1 | 560 2.03 100.00
------------+-----------------------------------
Total | 27,602 100.00

-> tabulation of source_MOH2
source_MOH2 | Freq. Percent Cum.
------------+-----------------------------------
0 | 27,042 97.97 97.97
1 | 560 2.03 100.00
------------+-----------------------------------
Total | 27,602 100.00

/* Example 2, I use "?i" with ustrregexn" and it picks up more records than "strpos" - I can't see why*/

. drop source_MOH1

. gen source_MOH1=ustrregexm(comment, "(?i)MOH")

. tab1 source_MOH1

-> tabulation of source_MOH1

source_MOH1 | Freq. Percent Cum.
------------+-----------------------------------
0 | 26,840 97.24 97.24
1 | 762 2.76 100.00
------------+-----------------------------------
Total | 27,602 100.00

Thanks very much for your help.

Amani
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10089
#6

24 Oct 2021, 08:23

Maybe my point in #2 and #4 is best illustrated with an example. Consider the following

Code:

. l +-----------+ | text | |-----------| 1. | xyz MOH33 | 2. | ABC mOh22 | 3. | efg moh11 | +-----------

We want to identify the substring "moh". But an issue is that in observation 1, the substring is capitalized, i.e., "MOH". In observation 2, it is a mixture of capital and small letters (also called upper-case and lower-case, respectively), i.e., "mOh". Finally, in observation 3, it is all in small letters, i.e., "moh". Now, what I mean by case-insensitive matching is that I should match the substring regardless of capitalization. This is achieved by either

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str29 text "xyz MOH33" "ABC mOh22" "efg moh11" "hij CONTROL" end gen match1= ustrregexm(text, "(?i)moh") gen match2= strpos(lower(text), "moh")>0

Notice the use of the -lower()- function in generating "match2" when using the -strpos()- function. Excluding "(?i)" and the -lower()- function in -ustrregexm()- and -strpos()-, respectively, executes case-sensitive matching, which as your example above shows, yields equivalent results.

Res.:

Code:

. l +-------------------------------+ | text match1 match2 | |-------------------------------| 1. | xyz MOH33 1 1 | 2. | ABC mOh22 1 1 | 3. | efg moh11 1 1 | 4. | hij CONTROL 0 0 | +-------------------------------+
Comment

Announcement

Using substring functions in Stata 16.1

Comment

Comment

Comment

Comment

Comment