Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract numbers/values from string

    I'm trying to extract dollar amounts from strings. The strings may be unicode and not ascii. I haven't/can't review them all.

    Code:
    clear
    input str60 phrase
    "the $30.0 million shares"
    "if $999,999 dollars are"
    "can add $45 billion"
    "greater than $3.02  per share"
    "the 75 free turkeys "
    end
    I want the numbers following the $ only.

    This code doesnt work, but is close I think.

    Code:
    gen amount = ""
    replace amount = regexs(1) if regexm(phrase, "\$(\d{1,3}(?:,\d{3})*(?:\.\d+)?)")
    The solution should be:

    Code:
    clear
    input float amount
    30.0
    999999
    45
    3.02
    .
    end
    Thanks in advance.

    Related, if I want to practice regex here: https://regex101.com/ do you know what "Flavor" (left tab) Stata uses/is?
    Last edited by Kyle Smith; 30 Mar 2025, 20:29.

  • #2
    Andrew Musau is good at manipulating regular expression. Before he provide a solution, you can temporarily use an user-written command -chimchar-.
    Code:
    ssc install chimchar
    chimchar phrase, numonly
    Now removing uniquely obnoxious characters like ` and () from phrase
    Uniquely obnoxious characters like ` and () have now been removed from phrase
    Now replacing special letters like Æ and ĸ with normal letters in phrase
    Special letters like Æ and ĸ have now been replaced with normal letters in phrase
    Now removing all remaining non-numeric characters from phrase
    All remaining non-numeric characters have been removed from phrase
    phrase is clean now!
    
    list
    
         +--------+
         | phrase |
         |--------|
      1. |   30.0 |
      2. | 999999 |
      3. |     45 |
      4. |   3.02 |
      5. |     75 |
         +--------+

    Comment


    • #3
      This finds the (last) word that begins with a dollar sign. If wanted numbers may occur more than once, a more elaborate approach is needed.

      Code:
      clear
      input str60 phrase
      "the $30.0 million shares"
      "if $999,999 dollars are"
      "can add $45 billion"
      "greater than $3.02  per share"
      "the 75 free turkeys "
      end
      
      split phrase 
      
      gen wanted = ""
      
      foreach v of var phrase? { 
          replace wanted = substr(`v', 2, .) if substr(`v', 1, 1) == "$"
          replace wanted = subinstr(wanted, ",", "", .)
      }
      
      list phrase wanted

      Comment


      • #4
        Here is a solution with regular expressions:

        Code:
        clear
        input str60 phrase
        "the $30.0 million shares"
        "if $999,999 dollars are"
        "can add $45 billion"
        "greater than $3.02  per share"
        "the 75 free turkeys "
        "$5.0! gosh!"
        "the password is am$135"
        end
        
        gen amount = real(subinstr(regexs(2), ",", "", .)) if regexm(phrase, "(^|\s)[$]([\d,\.]+)\b")
        which produces:

        Code:
        . list, noobs sep(0)
        
          +----------------------------------------+
          |                        phrase   amount |
          |----------------------------------------|
          |      the $30.0 million shares       30 |
          |       if $999,999 dollars are   999999 |
          |           can add $45 billion       45 |
          | greater than $3.02  per share     3.02 |
          |          the 75 free turkeys         . |
          |                   $5.0! gosh!        5 |
          |        the password is am$135        . |
          +----------------------------------------+
        Also a quick note: #3 works nicely for OP's example, but may need some improvement if letters like exclamation marks can follow the dollar amount (as in my second last observation). In that case #3 will also include that letter.
        Last edited by Hemanshu Kumar; 31 Mar 2025, 03:12.

        Comment


        • #5
          Nick Cox : Thanks for your solution.
          Hemanshu Kumar : Thanks for the solution.
          Chen Samulsion : Thanks for your solution.

          You guys are great at regex. I'm jealous. Thx for your time/help.

          Comment

          Working...
          X