Removing character at beginning or end of string

Anne Todd

Join Date: Dec 2018

Posts: 163
#1

Removing character at beginning or end of string

31 Jul 2023, 19:25

My last question a couple hours ago led me to realize a new problem: there are a handful of instances where I end up with a string variable that has a "." either at the beginning, or at the end, of the string. If this were the only "." present in the string, it would be easier to remove, but I'm not sure how to tell Stata that I want to remove the "." only if it is at the beginning, or at the end, of the string. Data here:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str6(var1 var2) ".4.20" "2.01." ".10.09" "12.32." ".1.04" ".98" ".45" "1.24." ".0.98" "1.1" end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

31 Jul 2023, 19:44

Code:

foreach v of varlist var1 var2 { replace `v' = ustrregexrf(`v', "^\.", "") replace `v' = ustrregexrf(`v', "\.$", "") } list, noobs clean

Added: Although the above code does what you asked in #1, I wonder if it is really what you want. If you look at observations 3 and 4 of your data, the starting strings in var2 and var1, respectively, are .98 and .45. If you remove the initial . character, you will be left with 98 and 45. Is that what you want? I raise it because .98 and .45 are legitimate decimal numbers and those initial dots might be meaningful, not spurious. (A similar problem would not arise with . at the end of the string.) If you want to keep the initial . if it is the only . in the string, then the code would be:

Code:

foreach v of varlist var1 var2 { gen dot_count = strlen(`v') - strlen(subinstr(`v', ".", "", .)) replace `v' = ustrregexrf(`v', "^\.", "") if dot_count > 1 replace `v' = ustrregexrf(`v', "\.$", "") drop dot_count }

Last edited by Clyde Schechter; 31 Jul 2023, 19:50.
1 like
Comment
Fahad Mirza

Join Date: Sep 2018

Posts: 241
#3

01 Aug 2023, 00:42

This is a neat code
Comment
daniel klein

Join Date: Mar 2014

Posts: 3848
#4

01 Aug 2023, 03:08

Originally posted by Anne Todd View Post

My last question a couple hours ago led me to realize a new problem: there are a handful of instances where I end up with a string variable that has a "." either at the beginning, or at the end, of the string.

I suppose your refer to this question. If you had four two-digit numbers, separated by dots, and were using the suggested code, how could you possibly end up with the results you now report here? I believe you should really make sure that the contents of your original variable are as you expect before throwing all kinds of different "solutions" at it.

Here is one way

Code:

clear input str11 myvar "29.34.52.92" "1.2.3.4" end local four_two_digit_numbers "^(\d{2}\.\d{2})\.(\d{2}\.\d{2})$" generate part1 = ustrregexs(1) if ustrregexm(myvar, "`four_two_digit_numbers'") generate part2 = ustrregexs(2) if ustrregexm(myvar, "`four_two_digit_numbers'") list

which will produce

Code:

. list +-----------------------------+ | myvar part1 part2 | |-----------------------------| 1. | 29.34.52.92 29.34 52.92 | 2. | 1.2.3.4 | +-----------------------------+

Missing values in part1 and/or part2 indicate unexpected patterns in the original variable.

btw a more direct test of your assumption that you have four two-digit numbers is

Code:

assert ustrregexm(myvar, "^(\d{2}\.\d{2})\.(\d{2}\.\d{2})$")

Last edited by daniel klein; 01 Aug 2023, 03:19.
Comment

Announcement

Removing character at beginning or end of string

Comment

Comment

Comment