String variables

Masoumeh Sanagou

Join Date: May 2017

Posts: 107
#1

String variables

01 Oct 2020, 20:03

Hi STATALIST,

I have a string variable and detected ga68 out of it by:

Code:

gen ga68 = (strpos(lower(var1), "ga") > 0) & /// (strpos(lower(var1), "68") > 0)

ga68 is 1 if var1 includes ga68 or 68ga.

I need to pick up ga68 only because the order is important. I don't want "ga68" because it could be spaces or non-numeric characters between ga and 68. (e.g. ga 68, ga _ 68, ..)

Could you please let me know your advice?

Regards,

Last edited by Masoumeh Sanagou; 01 Oct 2020, 20:14.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

01 Oct 2020, 21:28

Well, if anything could appear between the ga and 68 and all that matters is the order you can do:

Code:

gen ga68 = strmatch(lower(var1), "*ga*68*")

But this will also pick up things like gab68.
1 like
Comment
Masoumeh Sanagou

Join Date: May 2017

Posts: 107
#3

01 Oct 2020, 21:41

How about:
ga68=0 if numbers and alphabet appear between them and ga68=1 if other characters (e.g. - , /, _) and spaces appear between them ?

Regards,
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4374
#4

01 Oct 2020, 21:47

Originally posted by Masoumeh Sanagou View Post

I need to pick up ga68 only because the order is important.

Well, you're using the function strpos() and that gives you string position, and not only presence. Use the position information that the function gives you too in order to discern the relative position that you seek.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#5

02 Oct 2020, 14:13

Re #3: in this situation, you would need to use the regular expression functions in Stata. See -help regexm()-.
1 like
Comment

Masoumeh Sanagou

Join Date: May 2017
Posts: 107

03 Oct 2020, 07:12

Thanks for all advice.

Code:

gen a=regexm(lower(var1),  "[p][t][^a-z0-9]*[o][t][h][e][r]")

gen b=regexm(lowervar1),  "[pt][^a-z0-9]*[o][t][h][e][r]")

gen c=regexm(lower(var1),  "[p][t][^a-z0-9]*[other]")


gen e=regexm(lower(var1),  "[pt][^a-z0-9]*[other]")

Why e is not equal to a? (a=b=c)

Regards,

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#7

03 Oct 2020, 10:32

It isn't apparent to me why e should not be the same as a, b, and c. Please use the -dataex- command to post some example data that illustrates the difference. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment

Masoumeh Sanagou

Join Date: May 2017
Posts: 107

03 Oct 2020, 21:53

Thank you for the reply.

Code:

gen a=regexm(lower(var1), "[p][t][^a-z0-9]*[o][t][h][e][r]")
gen b=regexm(lowervar1), "[pt][^a-z0-9]*[o][t][h][e][r]")
gen c=regexm(lower(var1), "[p][t][^a-z0-9]*[other]")

gen e=regexm(lower(var1), "[pt][^a-z0-9]*[other]")

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str125 var1 float(a b c e)
"CT Abdomen and Pelvis"                        0 0 0 1
"CT Brain"                                     0 0 0 0
"CT Injection"                                 0 0 0 0
"CT Angiogram Abdominal Aorta"                 0 0 0 0
"CT Angiogram Pulmonary CTPA"                  0 0 0 0
"CT 4D Tracheomalacia Dynamic Airways"         0 0 0 1
"CT Humerus Right"                             0 0 0 1
"PT Other F Torso"                             1 1 1 1
"PT Other F Torso+Diag CT"                     1 1 1 1
"PT Other"                                     1 1 1 1
end

I really appreciate your time and help.
Regards,

Comment

Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#9

04 Oct 2020, 04:49

If possible use the more general ustrregexm(), and other ustrregex functions, which use the ICU regex library documented at http://userguide.icu-project.org/strings/regexp

Why e is not equal to a?

You use regex charater class [] , and quatifier (*) in your regex:

A character class [] accept ANY ONE of the characters within the square brackets.

Thus, [pt] ( one of "p" OR "t") is not the same as [p][t] ( a "p" followed by a "t"), and your "[o][t][h][e][r]" is better expressed as the string "other"

Code:

di regexm("other", "[o][t][h][e][r]") di regexm("other", "other") di regexm("other", "[other]") di regexm("r", "[other]")

your pattern "[pt][^a-z0-9]*[other]" can be described as;
[pt] match a single character in [pt]

[^a-z0-9] match a single character NOT present in [^a-z0-9]

* zero or more times, as many times as possible

[other] match a single character in [other]

thus the match in the string "ct abdomen and pelvis" will be (in bold)

ct abdomen and pelvis

Last edited by Bjarte Aagnes; 04 Oct 2020, 04:53.
2 likes
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment