Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting last three digits from a string, regex

    Colleagues,

    I have a variables that resembles the list below:

    Code:
    var
    def (abc) 6
    def (byz) 12
    abc (ghi) 100
    I would like to extract last portion of the string and obtain var2 with figures only:

    Code:
    var2
    6
    12
    100
    Attempts with substr, like the one outlined, result in var2 containing the ")" sign as it is placed within the last three characters. In word, I'm looking for a regex that would extract last three numeric charters ignoring everything else.

    Code:
    generate var2= substr(var,-3,.)
    Kind regards,
    Konrad
    Version: Stata/IC 13.1

  • #2
    why not stay with what you have and then followup with "destring var2, replace ignore(")") ?

    Comment


    • #3
      Here are a few ways of doing this using regex functions

      Code:
      clear
      input str50 var
      "def (abc) 6"
      "def (byz) 12"
      "abc (ghi) 100"
      end
      
      * remove everything up to and including the last space (greedy match)
      gen var2 = regexr(var,".+ ","")
      
      * skip anything that is not a number and match anything after
      gen var3 = regexs(1) if regexm(var,"[^0-9]+(.+)")
      
      * match a space that is followed a series of digits at the end of the string
      * and return the digits
      gen var4 = regexs(1) if regexm(var," ([0-9]+)$")
      
      list

      Comment


      • #4
        To be precise the variable has a number of odd entries, like 12 abc (def) 6 or 10 xyz (KLM) 12. In the context of this exercise I am only interested in keeping the last 1 - 3 digits.
        Kind regards,
        Konrad
        Version: Stata/IC 13.1

        Comment


        • #5
          Two ways (one will fail under assumed condition):

          Code:
          clear all
          set more off
          
          input ///
          str20 var1
          "def (abc) 6"
          "def (byz) 12"
          "abc (ghi) 100"
          "abc jkl 100"
          end
          
          list
          
          *-----
          
          gen var3 = trim(substr(var1, strpos(var1, ")") + 1, .))
          
          *-----
          gen var4 = regexs(0) if(regexm(var1, "[0-9]*$"))
          
          list
          You should:

          1. Read the FAQ carefully.

          2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

          3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

          4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

          Comment


          • #6
            Revised example

            Code:
            clear
            input str50 var
            "def (abc) 6"
            "def (byz) 12"
            "abc (ghi) 100"
            "12 abc (def) 6"
            "10 xyz (KLM) 12"
            end
            
            * remove everything up to and including the last space (greedy match)
            gen var2 = regexr(var,".+ ","")
            
            * match a space that is followed a series of digits at the end of the string
            * and return the digits
            gen var3 = regexs(1) if regexm(var," ([0-9]+)$")
            
            * match 1 to 3 digits at the end of the string
            gen var4 = regexs(1) if regexm(var,"([0-9]*[0-9]*[0-9])$")
            
            list

            Comment


            • #7
              Actually, I messed-up the last example, to match 1 to 3 digits at the end of the string, use

              Code:
              gen var4 = regexs(1) if regexm(var,"([0-9]?[0-9]?[0-9])$")

              Comment


              • #8
                Robert,

                Thank you very much. The last code works like a charm. Just for the educational purposes, may I ask why is there regexs(1) not regexs(0) as suggested here and what is the role of the "$" sign in the regexem?
                Last edited by Konrad Zdeb; 31 Jul 2014, 02:30. Reason: Spelling.
                Kind regards,
                Konrad
                Version: Stata/IC 13.1

                Comment


                • #9
                  regexs(0) returns the whole string that matches the regular expression. Parenthesis are used to identify subexpressions. regexs(1) returns the first subexpression, regexs(2) the second, and so on. In my last example, both regexs(0) and regexs(1) will return the same string. If you modify the statement as follows:

                  Code:
                  gen var1 = regexs(1) if regexm(var," ([0-9]?[0-9]?[0-9])$")
                  then you require that the matched digits be preceded by a space. regexs(0) will include the leading space but not regexs(1).

                  The "$" is a special character that represents the end of the string.

                  There's an FAQ on regular expressions in Stata but it's pretty basic. There's of course plenty of information on the internets.

                  Note that the word() string function can count words from the end of the string. So if you are targeting numbers at the end of a string:

                  Code:
                  gen nlast = real(word(var,-1))

                  Comment


                  • #10
                    Robert,

                    Thanks very much for the useful comments. Once I came across this tutorial but it doesn't cover all possibilities and issues specific to Stata.
                    Kind regards,
                    Konrad
                    Version: Stata/IC 13.1

                    Comment


                    • #11
                      • i got a variable name chid and it contains 10 digit i wana keep 1st 8 digit & delete last two digit, please some one help me by send me the command

















                      Comment


                      • #12
                        • i got a variable name chid and it contains 10 digit data for each entry i wana keep 1st 8 digit & delete last two digit from all data that are in chid variable, please some one help me by send me the command

















                        Comment


                        • #13
                          For future reference, it would have been better to start a new thread, rather than tacking this on to a (somewhat) related thread that is essentially already closed.

                          That said, it depends on whether child is a string variable or a numeric variable.

                          Code:
                          gen child_8digit = substr(child, 1, 8) // IF CHILD IS A STRING VARIABLE
                          
                          gen long child_8digit = floor(child/100) // IF CHILD IS A NUMERIC VARIABLE
                          Note: The code I have shown for string variables will only work if all of the entries in child actually have 10 digits. If some of them are shorter, it will fail. But then you need to clarify whether you want, in that case, the first 8 digits, or everything but the last 2 digits.

                          Comment

                          Working...
                          X