Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing character at beginning or end of string

    My last question a couple hours ago led me to realize a new problem: there are a handful of instances where I end up with a string variable that has a "." either at the beginning, or at the end, of the string. If this were the only "." present in the string, it would be easier to remove, but I'm not sure how to tell Stata that I want to remove the "." only if it is at the beginning, or at the end, of the string. Data here:



    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str6(var1 var2)
    ".4.20"  "2.01." 
    ".10.09" "12.32."
    ".1.04"  ".98"   
    ".45"    "1.24." 
    ".0.98"  "1.1"   
    end

  • #2
    Code:
    foreach v of varlist var1 var2 {
        replace `v' = ustrregexrf(`v', "^\.", "")
        replace `v' = ustrregexrf(`v', "\.$", "")
    }
    
    list, noobs clean
    Added: Although the above code does what you asked in #1, I wonder if it is really what you want. If you look at observations 3 and 4 of your data, the starting strings in var2 and var1, respectively, are .98 and .45. If you remove the initial . character, you will be left with 98 and 45. Is that what you want? I raise it because .98 and .45 are legitimate decimal numbers and those initial dots might be meaningful, not spurious. (A similar problem would not arise with . at the end of the string.) If you want to keep the initial . if it is the only . in the string, then the code would be:
    Code:
    foreach v of varlist var1 var2 {
        gen dot_count = strlen(`v') - strlen(subinstr(`v', ".", "", .))
        replace `v' = ustrregexrf(`v', "^\.", "") if dot_count > 1
        replace `v' = ustrregexrf(`v', "\.$", "")
        drop dot_count
    }
    Last edited by Clyde Schechter; 31 Jul 2023, 19:50.

    Comment


    • #3
      This is a neat code

      Comment


      • #4
        Originally posted by Anne Todd View Post
        My last question a couple hours ago led me to realize a new problem: there are a handful of instances where I end up with a string variable that has a "." either at the beginning, or at the end, of the string.
        I suppose your refer to this question. If you had four two-digit numbers, separated by dots, and were using the suggested code, how could you possibly end up with the results you now report here? I believe you should really make sure that the contents of your original variable are as you expect before throwing all kinds of different "solutions" at it.

        Here is one way

        Code:
        clear
        input str11 myvar
        "29.34.52.92"
        "1.2.3.4"
        end
        
        local four_two_digit_numbers "^(\d{2}\.\d{2})\.(\d{2}\.\d{2})$"
        
        generate part1 = ustrregexs(1) if ustrregexm(myvar, "`four_two_digit_numbers'")
        
        generate part2 = ustrregexs(2) if ustrregexm(myvar, "`four_two_digit_numbers'")
            
        list
        which will produce

        Code:
        . list
        
             +-----------------------------+
             |       myvar   part1   part2 |
             |-----------------------------|
          1. | 29.34.52.92   29.34   52.92 |
          2. |     1.2.3.4                 |
             +-----------------------------+
        Missing values in part1 and/or part2 indicate unexpected patterns in the original variable.


        btw a more direct test of your assumption that you have four two-digit numbers is

        Code:
        assert ustrregexm(myvar, "^(\d{2}\.\d{2})\.(\d{2}\.\d{2})$")
        Last edited by daniel klein; 01 Aug 2023, 03:19.

        Comment

        Working...
        X