Statelisters:
I've got some data that was OCRed (optical character recognition) and many spaces have been inserted inside of words (i.e., "T ues day, Jan uary 3rd" etc.), and many characters, especially commas or periods, are also missing. One thing that would help is if I could remove spaces that are followed by lower case letters for the creation of a temporary variable to use to identify dates. (I've been working with strings after converting to all lower case and removing all symbols (including spaces) or sometimes leaving a few symbols in (commas, periods).)
I could loop through each adjacent pair of letters, and for each of those pairs loop through the lower case letters of the alphabet to see if there's a match with the second character in a pair, and in that case, if the first letter of the pair is a space, get rid of it. I could also get rid of all spaces and then insert spaces in front of every capital letter, again by looping. I'm concerned that would take a long time.
I don't see how I can use regexr or regexs to do this.
I tried
set obs 2
gen v1="G h h D" in 1
replace v1="G H j " in 2
gen v2=regexr(v1," [a-z]","[a-z]")
But that resulted in
list
| v1 v2 |
|----------------------|
1. | G h h D G[a-z] h D |
2. | G H j G H[a-z] |
Advice on how to do this better would be much appreciated.
I'm using Stata 15.1.
Thanks,
Carl
I've got some data that was OCRed (optical character recognition) and many spaces have been inserted inside of words (i.e., "T ues day, Jan uary 3rd" etc.), and many characters, especially commas or periods, are also missing. One thing that would help is if I could remove spaces that are followed by lower case letters for the creation of a temporary variable to use to identify dates. (I've been working with strings after converting to all lower case and removing all symbols (including spaces) or sometimes leaving a few symbols in (commas, periods).)
I could loop through each adjacent pair of letters, and for each of those pairs loop through the lower case letters of the alphabet to see if there's a match with the second character in a pair, and in that case, if the first letter of the pair is a space, get rid of it. I could also get rid of all spaces and then insert spaces in front of every capital letter, again by looping. I'm concerned that would take a long time.
I don't see how I can use regexr or regexs to do this.
I tried
set obs 2
gen v1="G h h D" in 1
replace v1="G H j " in 2
gen v2=regexr(v1," [a-z]","[a-z]")
But that resulted in
list
| v1 v2 |
|----------------------|
1. | G h h D G[a-z] h D |
2. | G H j G H[a-z] |
Advice on how to do this better would be much appreciated.
I'm using Stata 15.1.
Thanks,
Carl
Comment