Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to extract numeric part of a string var in STATA

    Dear everyone,

    I would like to know if someone knows a STATA code that I can use to extract numeric part of a string variable in STATA.

    My string data is the following:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str516 salario
    "Desde $70,000 bruto por mes "                  
    "Desde $70,000 bruto por mes "                  
    "Desde $70,000 bruto por mes "                  
    "Desde $70,000 bruto por mes "                  
    "No especificado "                  
    "No especificado "                  
    "No especificado "                  
    "De $1,400,000 a $2,000,000 bruto por año "
    "De $1,400,000 a $2,000,000 bruto por año "
    "De $1,400,000 a $2,000,000 bruto por año "
    "De $1,400,000 a $2,000,000 bruto por año "
    "No especificado "                  
    "Desde $70,000 bruto por mes "                  
    "Desde $70,000 bruto por mes "                  
    "No especificado "                  
    "No especificado "                  
    "De $14,000 a $18,000 bruto por mes "
    "De $9,500 a $10,000 bruto por mes "
    "De $1 a $14,500 bruto por mes "                  
    "De $1 a $14,500 bruto por mes "                  
    "De $1 a $14,500 bruto por mes "                  
    "De $1 a $14,500 bruto por mes "                  
    "De $1,080,000 a $1,440,000 bruto por año "                  
    "De $1,080,000 a $1,440,000 bruto por año "                  
    "De $1,080,000 a $1,440,000 bruto por año "                  
    "De $1,080,000 a $1,440,000 bruto por año "                  
    "De $1,080,000 a $1,440,000 bruto por año "                  
    "No especificado "                  
    "No especificado "                  
    "No especificado "                  
    "De $5,000 a $7,000 bruto por mes "  
    "De $16,000 a $16,500 bruto por mes "
    "De $5,000 a $7,000 bruto por mes "  
    "Hasta $104,000 bruto por mes "                  
    "Hasta $104,000 bruto por mes "                  
    "Hasta $104,000 bruto por mes "                  
    "No especificado "                  
    "No especificado "                  
    "No especificado "                  
    end
    I want to extract for each row of variable "salario" the numeric part.

    For instance: { "De $5,000 a $7,000 bruto por mes " --> 5000 } or { "De $1 a $14,500 bruto por mes " --> 14500 } or { "Hasta $104,000 bruto por mes " --> 104000 }

    Thanks a lot for your help

    Alexis Rodas
    Last edited by Alexis Rodas; 21 Mar 2019, 16:11.

  • #2
    This isn't the most elegant, and perhaps someone else will provide a better alternative.
    Code:
    . generate str20 s1 = ustrregexs(0) if ustrregexm(salario,"[0-9,]+")
    (12 missing values generated)
    
    . destring s1, ignore(",") generate(s2)
    s1: character , removed; s2 generated as long
    (12 missing values generated)
    
    . list if s1 != s1[_n-1], clean
    
                                             salario          s1        s2  
      1.                Desde $70,000 bruto por mes       70,000     70000  
      5.                            No especificado                      .  
      8.   De $1,400,000 a $2,000,000 bruto por año    1,400,000   1400000  
     12.                            No especificado                      .  
     13.                Desde $70,000 bruto por mes       70,000     70000  
     15.                            No especificado                      .  
     17.         De $14,000 a $18,000 bruto por mes       14,000     14000  
     18.          De $9,500 a $10,000 bruto por mes        9,500      9500  
     19.              De $1 a $14,500 bruto por mes            1         1  
     23.   De $1,080,000 a $1,440,000 bruto por año    1,080,000   1080000  
     28.                            No especificado                      .  
     31.           De $5,000 a $7,000 bruto por mes        5,000      5000  
     32.         De $16,000 a $16,500 bruto por mes       16,000     16000  
     33.           De $5,000 a $7,000 bruto por mes        5,000      5000  
     34.               Hasta $104,000 bruto por mes      104,000    104000  
     37.                            No especificado                      .

    Comment


    • #3
      Thanks, Willian.

      An extra question, can the second part be extracted instead of the first?
      In your code you extract the first part: From $ 1 to $ 14,500 gross per month - >> you extract "1". Can be extracted the second part: "14500"?.

      Thanks a lot for your help

      Alexis Rodas

      Comment


      • #4
        Code:
        . generate str20 s1 = ustrregexs(1) if ustrregexm(salario,"([\d,]+)")
        (12 missing values generated)
        
        . generate str20 s2 = ustrregexs(2) if ustrregexm(salario,"([\d,]+)[^\d,]+([\d,]+)")
        (21 missing values generated)
        
        . destring s1 s2, ignore(",") generate(n1 n2)
        s1: character , removed; n1 generated as long
        (12 missing values generated)
        s2: character , removed; n2 generated as long
        (21 missing values generated)
        
        . list if salario != salario[_n-1], clean
        
                                                 salario          s1        n1          s2        n2  
          1.                Desde $70,000 bruto por mes       70,000     70000                     .  
          5.                            No especificado                      .                     .  
          8.   De $1,400,000 a $2,000,000 bruto por año    1,400,000   1400000   2,000,000   2000000  
         12.                            No especificado                      .                     .  
         13.                Desde $70,000 bruto por mes       70,000     70000                     .  
         15.                            No especificado                      .                     .  
         17.         De $14,000 a $18,000 bruto por mes       14,000     14000      18,000     18000  
         18.          De $9,500 a $10,000 bruto por mes        9,500      9500      10,000     10000  
         19.              De $1 a $14,500 bruto por mes            1         1      14,500     14500  
         23.   De $1,080,000 a $1,440,000 bruto por año    1,080,000   1080000   1,440,000   1440000  
         28.                            No especificado                      .                     .  
         31.           De $5,000 a $7,000 bruto por mes        5,000      5000       7,000      7000  
         32.         De $16,000 a $16,500 bruto por mes       16,000     16000      16,500     16500  
         33.           De $5,000 a $7,000 bruto por mes        5,000      5000       7,000      7000  
         34.               Hasta $104,000 bruto por mes      104,000    104000                     .  
         37.                            No especificado                      .                     .

        Comment


        • #5
          Consider this example using moss (SSC) by Robert Picard and friend:

          Code:
          . gen work = subinstr(salario, ",", "", .)
          
          . moss work, match("([0-9]+)")  regex
          
          . l _match*
          
               +-------------------+
               | _match1   _match2 |
               |-------------------|
            1. |   70000           |
            2. |   70000           |
            3. |   70000           |
            4. |   70000           |
            5. |                   |
               |-------------------|
            6. |                   |
            7. |                   |
            8. | 1400000   2000000 |
            9. | 1400000   2000000 |
           10. | 1400000   2000000 |
               |-------------------|
           11. | 1400000   2000000 |
           12. |                   |
           13. |   70000           |
           14. |   70000           |
           15. |                   |
               |-------------------|
           16. |                   |
           17. |   14000     18000 |
           18. |    9500     10000 |
           19. |       1     14500 |
           20. |       1     14500 |
               |-------------------|
           21. |       1     14500 |
           22. |       1     14500 |
           23. | 1080000   1440000 |
           24. | 1080000   1440000 |
           25. | 1080000   1440000 |
               |-------------------|
           26. | 1080000   1440000 |
           27. | 1080000   1440000 |
           28. |                   |
           29. |                   |
           30. |                   |
               |-------------------|
           31. |    5000      7000 |
           32. |   16000     16500 |
           33. |    5000      7000 |
           34. |  104000           |
           35. |  104000           |
               |-------------------|
           36. |  104000           |
           37. |                   |
           38. |                   |
           39. |                   |
               +-------------------+
          Here for the example data moss finds 0, 1, or 2 matches, but it would find more if they existed. Naturally you can follow with destring if desired.

          Comment


          • #6
            Aha, Nick rose to the bait and provided the more elegant solution.

            Comment


            • #7
              The price is an obligation to read and act on https://www.statalist.org/forums/help#spelling

              Comment


              • #8
                Most beautiful solutions. Wow.

                Comment


                • #9
                  I just used and found this helpful too. Thank you!

                  Comment


                  • #10
                    Just what I was looking. Thank you! How can Nicks' code be modified if the number is a decimal?? Also, I have a variable which is like the following

                    var1 $xx.xx var2 $xx.xx var3 $xx.xx
                    var2 $xx.xx var3 $xx.xx
                    var1 $xx.xx var3$xx.xx

                    I want to create three variables with the above variable with names var1, var2 and var3. However, the problem is that all entries of the variable do not have three values. So, for the second entry, var1 should be empty. However, if I simply extract all the numerals, var1 will take var2's value and and var3 will take var3's value. How can I solve this? Thanks in advance!

                    Comment


                    • #11
                      Concrete data example please

                      Comment


                      • #12
                        An improved version of #10 was posted and answered at https://stackoverflow.com/questions/...ween-two-words

                        Comment


                        • #13
                          Thanks a lot!

                          Comment

                          Working...
                          X