Remove Spaces When Followed by Lowercase Letter

Carl Klarner

Join Date: Apr 2017

Posts: 20
#1

Remove Spaces When Followed by Lowercase Letter

26 Apr 2019, 19:45

Statelisters:

I've got some data that was OCRed (optical character recognition) and many spaces have been inserted inside of words (i.e., "T ues day, Jan uary 3rd" etc.), and many characters, especially commas or periods, are also missing. One thing that would help is if I could remove spaces that are followed by lower case letters for the creation of a temporary variable to use to identify dates. (I've been working with strings after converting to all lower case and removing all symbols (including spaces) or sometimes leaving a few symbols in (commas, periods).)

I could loop through each adjacent pair of letters, and for each of those pairs loop through the lower case letters of the alphabet to see if there's a match with the second character in a pair, and in that case, if the first letter of the pair is a space, get rid of it. I could also get rid of all spaces and then insert spaces in front of every capital letter, again by looping. I'm concerned that would take a long time.

I don't see how I can use regexr or regexs to do this.

I tried

set obs 2
gen v1="G h h D" in 1
replace v1="G H j " in 2
gen v2=regexr(v1," [a-z]","[a-z]")

But that resulted in

list
| v1 v2 |
|----------------------|
1. | G h h D G[a-z] h D |
2. | G H j G H[a-z] |

Advice on how to do this better would be much appreciated.

I'm using Stata 15.1.

Thanks,

Carl
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#2

26 Apr 2019, 20:34

Try this:

Code:

clear set obs 2 gen v1="G h k D" in 1 replace v1="G H j " in 2 gen v2 = regexs(1) if regexm(v1, "(( [a-z])+)") gen v3 = subinstr(v2, " ", "", .) gen v4 = subinstr(v1, v2, v3, .) if !missing(v2)

v4 is, I believe, what you are looking for. But if your v1 contains these spurious blanks embedded at a distance from each other, then you will probably have to repeat this code several times.
Comment

Carl Klarner

Join Date: Apr 2017
Posts: 20

27 Apr 2019, 10:10

Thanks Clyde, that is great!

I'm just learning regular expressions, so I've got a question about the regexm in your answer.

Where your code says the following

regexm(v1, "(( [a-z])+)")

, why are the brackets just inside the quotes necessary? Why not just the following?

regexm(v1, "( [a-z])+")

They definitely change things, see line 6 below, for v4 versus v7. However, I get the same thing if I change regexs from 1 to 0 which result in v4 and v10, respectively. Can you help me understand what I'm missing?

clear
set obs 6
gen v1="G h h D" in 1
replace v1="G H j " in 2
replace v1="G hhh h D" in 3
replace v1="G H jhh " in 4
replace v1="G H j Ghhh G h" in 5
replace v1="G h j D" in 6
*Clyde's code
gen v2 = regexs(1) if regexm(v1, "(( [a-z])+)")
gen v3 = subinstr(v2, " ", "", .)
gen v4 = subinstr(v1, v2, v3, .) if !missing(v2)
*Alternate code: takes out parentheses.
gen v5 = regexs(1) if regexm(v1, "( [a-z])+")
gen v6 = subinstr(v5, " ", "", .)
gen v7 = subinstr(v1, v5, v6, .) if !missing(v5)
*Alternate code: takes out parentheses and changes regexs from 1 to 0.
gen v8 = regexs(0) if regexm(v1, "( [a-z])+")
gen v9 = subinstr(v8, " ", "", .)
gen v10 = subinstr(v1, v8, v9, .) if !missing(v8)
list v1 v4 v7 v10

	v1	v4	v7	v10
1	G h h D	Ghh D	Ghh D	Ghh D
2	G H j	G Hj	G Hj	G Hj
3	G hhh h D	Ghhhh D	Ghhhh D	Ghhhh D
4	G H jhh	G Hjhh	G Hjhh	G Hjhh
5	G H j Ghhh G h	G Hj Ghhh G h	G Hj Ghhh G h	G Hj Ghhh G h
6	G h j D	Ghj D	G hj D	Ghj D

Next, the following will get rid of all the errant spaces, even if there are areas where the rule holds (pairs composed of a space followed by a lower case letter) separated from each other by strings that don't follow the rule. Advice on how to write better code is always appreciated.

clear
set obs 4
gen v1="G h h D" in 1
replace v1="G H j" in 2
replace v1="G H j Ghhh G h" in 3
replace v1="G H j Ghhh g G hhhh" in 4
gen len=0
gen v2=""
gen v3=""
gen v4=v1
local meanlen1 1
local meanlen2 0
while `meanlen1'>`meanlen2' {
replace len=length(v4)
quietly sum len
local meanlen1=r(mean)
replace v2 = regexs(1) if regexm(v4, "( [a-z])")
replace v3 = subinstr(v2, " ", "", .)
replace v4 = subinstr(v4, v2, v3, .) if !missing(v2)
replace len=length(v4)
quietly sum len
local meanlen2=r(mean)
}
list

	v1	len	v2	v3	v4
1	G h h D	5	h	h	Ghh D
2	G H j	4	j	j	G Hj
3	G H j Ghhh G h	12	h	h	G Hj Ghhh Gh
4	G H j Ghhh g G hhhh	16	h	h	G Hj Ghhhg Ghhhh

Thanks again!

Carl

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#4

27 Apr 2019, 11:23

Well, regular expressions are not my forte, and I find them as confusing as the next person. I don't use them very often. But I can explain this particular situation.

"( [a-z])+)" will match a string that contains one or more consecutive sequences of a space followed by a lower case letter, but what it returns as -regexs(1)- is only the first of those.

"(( [a-z])+))" will match exactly the same strings but it returns all of them as -regexs(1)-.

But your alternate code does even better: it captures everything in -regexs(0)-. I had forgotten about that. That's simpler and better than what I suggested.

As for iterating the process until all of the spurious blanks have been purged, it can be done more simply as follows:

Code:

set obs 6 gen v1="G h h D" in 1 replace v1="G H j " in 2 replace v1="G hhh h D" in 3 replace v1="G H jhh " in 4 replace v1="G H j Ghhh G h" in 5 replace v1="G h j D" in 6 local go_on 1 gen result = v1 while (`go_on') { gen v2 = regexs(0) if regexm(result, "( [a-z])+") gen v3 = subinstr(v2, " ", "", .) replace result = subinstr(result, v2, v3, .) if !missing(v2) drop v2 v3 capture assert regexm(result, "( [a-z])+") == 0 if c(rc) == 0 { // NO MORE TO DO local go_on 0 } }
Comment
Carl Klarner

Join Date: Apr 2017

Posts: 20
#5

27 Apr 2019, 13:34

Hi Clyde,

I understand regexs and the role of the double parentheses now, thanks! And my "better code" was the result of dumb luck, not understanding.

The new code is also very helpful, I'll be using that capture assert combo a lot after this.

Carl
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29948
#6

27 Apr 2019, 13:54

I'll be using that capture assert combo a lot after this

Yes, it comes in handy in a lot of situations.

Since you plan to adopt it, I should show you the right way to do it. What I wrote there was the quick and dirty way and it can sometimes get you into trouble. The problem is that -capture- is indiscriminate: it will capture any return code whatsoever, and move on to the next command. Now, the intent here was that c(rc) will be 0 if we're done, and 9 (the Stata error code for "assertion is false") otherwise. But, in fact, there are other possibilities. If, for example, I had made a typo and bungled the balancing of quotes or parentheses in the regexm() expression, or had mistyped the name of a variable and landed on the wrong variable or a non-existent variable, these would also through a non-zero return code, but not the one that we're looking for. Similarly, we would want to know if, for example, Stata encountered memory limits while trying to evaluate that expression, etc. As written, the code would result in Stata blundering on through the loop having ignored an unexpected problem. In this case, the problem would almost certainly recur at each iteration of the loop, so you would end up with Stata stuck in an infinite loop. You can easily envision that in other contexts, however, the result of this construction would be to just produce incorrect results without giving any warning that anything is amiss.

So the right way to use it is not as I showed. The right way to use it is to check specifically for the return code(s) that you are anticipating as a signal for what to do next, and then, have another branch that deals with unexpected problems. Thus:

Code:

clear* set obs 6 gen v1="G h h D" in 1 replace v1="G H j " in 2 replace v1="G hhh h D" in 3 replace v1="G H jhh " in 4 replace v1="G H j Ghhh G h" in 5 replace v1="G h j D" in 6 local go_on 1 gen result = v1 while (`go_on') { gen v2 = regexs(0) if regexm(result, "( [a-z])+") gen v3 = subinstr(v2, " ", "", .) replace result = subinstr(result, v2, v3, .) if !missing(v2) drop v2 v3 capture assert regexm(result, "( [a-z])+") == 0 if c(rc) == 0 { // NO MORE TO DO local go_on 0 } else if c(rc) != 9 { // UNANTICIPATED PROBLEM display as error "Unanticipated problem in while-loop" exit `c(rc)' } // ELSE JUST CONTINUE THE LOOP }
2 likes
Comment
Carl Klarner

Join Date: Apr 2017

Posts: 20
#7

28 Apr 2019, 13:39

Hi Clyde,

Great, that is extremely useful, thank you for taking the time to help me.

Carl
Comment

Announcement