Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate multiple variables from a variable containing symbols and numbers

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str29 salary_today
    "243,250 (307,840) (253,454)"  
    "322,043 (342,970)"            
    "279,102 (365,736)"            
    "126,025[12]"                  
    "247,579††"                
    "166218"                       
    "138,740†††"             
    "161349"                       
    "130,646 (204,309)"            
    "254160"                       
    "238908"                       
    "129,517 (175,081)***‡‡‡"
    "228190"                       
    "117763"                       
    ""                             
    "188,723"                      
    "135586"                       
    "161,349 (197,454)"            
    "162056"                       
    end
    The variable salary_today looks like the following
    Code:
    . list
    
         +-----------------------------+
         |                salary_today |
         |-----------------------------|
      1. | 243,250 (307,840) (253,454) |
      2. |           322,043 (342,970) |
      3. |           279,102 (365,736) |
      4. |                 126,025[12] |
      5. |                   247,579†† |
         |-----------------------------|
      6. |                      166218 |
      7. |                  138,740††† |
      8. |                      161349 |
      9. |           130,646 (204,309) |
     10. |                      254160 |
         |-----------------------------|
     11. |                      238908 |
     12. |     129,517 (175,081)***‡‡‡ |
     13. |                      228190 |
     14. |                      117763 |
     15. |                             |
         |-----------------------------|
     16. |                     188,723 |
     17. |                      135586 |
     18. |           161,349 (197,454) |
     19. |                      162056 |
         +-----------------------------+
    I want to generate four variables salary, salary_p1, salary_p2, and salary_note : salary will contain the numbers before parenthesis; salary_p1 will contain the numbers between the first two parentheses; salary_p2 will contain the numbers between the last parentheses ; salary_note will contain all the symbols ( including [12] in observation 4).

    For example, for observation 12 salary will be 129517, salary_p1 will be 175081, salary_p2 will be missing , and salary_note will be ***‡‡‡.

  • #2
    I did it as a 4-step process (but others may come up with something better, especially if they are good with regex (which I am not):
    Step 1. Create salary_orig (so can make modifications to salary_today)
    Step 2. Get rid of [], the * and "†", and "‡") from salary_today
    Step 3. Split salary_today into words (can then destring to convert them to numbers)
    This will gen salary, salary_p2, salary_p3
    Step 4. Use regex or substr to get the symbols (assuming these are always far to the right)

    Code:
    * Create salary_orig (so can make modifications to salary_today)
    clonevar salary_orig = salary_today
    order salary_orig, before(salary_today)
    
    ssc install charlist // in case you don't have it
    charlist salary_today
    di r(ascii)
    
    /*
    ASCII characters
      40 = "("   , 41= ")"
      42 = *
      91 = [  , 93=]
      The characters "†" and "‡" are 160, 161, 226
    */
    
    * Get rid of the * and "†", and "‡" from salary_today (I had to do brackets later)
    replace salary_today = subinstr( salary_today, "*", "", .)
    replace salary_today = subinstr( salary_today, "†", "", .)
    replace salary_today = subinstr( salary_today, "‡", "", .)
    
    * Split salary_today into words
    split salary_today, gen(sal)  // creates sal1, sal2, sal3 (which can be renamed to salary, salary_p1, salary_p2)
    * split() defaults to splitting at spaces
    gen word_count = wordcount(salary_orig)   // may be useful to have
    
    * Finding positions of the various symbols (so can create salary_note)
    gen has_char = 0
    replace has_char = 1 if (strpos(salary_today, "*") | strpos(salary_today, "†") | strpos(salary_today, "‡") | strpos(salary_today, "["))
    * Above is not strictly necessary, but nice to find if salary_today contains any of the listed characters 
    
    gen s_pos1 = strpos( salary_orig, "*") if strpos( salary_orig, "*")>=1
    gen s_pos2 = strpos( salary_orig, char(160)) if strpos( salary_orig, char(160))>=1
    gen s_pos3 = strpos( salary_orig, char(161)) if strpos( salary_orig, char(161))>=1
    gen s_pos4 = strpos( salary_orig, char(226)) if strpos( salary_orig, char(226))>=1
    gen s_pos_bracket = strpos( salary_orig, "[") if strpos( salary_orig, "[")>=1
    egen note_start = rowmin( s_pos1- s_pos_bracket)  // finding position where first symbol starts
    gen sal_note = substr( salary_orig, note_start, .)

    After all that you end up with
    Code:
    . list salary_orig sal1 sal2 sal3 note_start sal_note , noobs abbrev(12)
    
      +-------------------------------------------------------------------------------------------+
      |                 salary_orig          sal1        sal2        sal3   note_start   sal_note |
      |-------------------------------------------------------------------------------------------|
      | 243,250 (307,840) (253,454)       243,250   (307,840)   (253,454)            .            |
      |           322,043 (342,970)       322,043   (342,970)                        .            |
      |           279,102 (365,736)       279,102   (365,736)                        .            |
      |                 126,025[12]   126,025[12]                                    8       [12] |
      |                   247,579††       247,579                                    8         †† |
      |-------------------------------------------------------------------------------------------|
      |                      166218        166218                                    .            |
      |                  138,740†††       138,740                                    8        ††† |
      |                      161349        161349                                    .            |
      |           130,646 (204,309)       130,646   (204,309)                        .            |
      |                      254160        254160                                    .            |
      |-------------------------------------------------------------------------------------------|
      |                      238908        238908                                    .            |
      |     129,517 (175,081)***‡‡‡       129,517   (175,081)                       18     ***‡‡‡ |
      |                      228190        228190                                    .            |
      |                      117763        117763                                    .            |
      |                                                                              .            |
      |-------------------------------------------------------------------------------------------|
      |                     188,723       188,723                                    .            |
      |                      135586        135586                                    .            |
      |           161,349 (197,454)       161,349   (197,454)                        .            |
      |                      162056        162056                                    .            |
      +-------------------------------------------------------------------------------------------+
    
    . list salary_orig has_char s_pos1- s_pos_bracket, noobs abbrev(12)
    
      +-------------------------------------------------------------------------------------------+
      |                 salary_orig   has_char   s_pos1   s_pos2   s_pos3   s_pos4   s_pos_brac~t |
      |-------------------------------------------------------------------------------------------|
      | 243,250 (307,840) (253,454)          0        .        .        .        .              . |
      |           322,043 (342,970)          0        .        .        .        .              . |
      |           279,102 (365,736)          0        .        .        .        .              . |
      |                 126,025[12]          1        .        .        .        .              8 |
      |                   247,579††          1        .       10        .        8              . |
      |-------------------------------------------------------------------------------------------|
      |                      166218          0        .        .        .        .              . |
      |                  138,740†††          1        .       10        .        8              . |
      |                      161349          0        .        .        .        .              . |
      |           130,646 (204,309)          0        .        .        .        .              . |
      |                      254160          0        .        .        .        .              . |
      |-------------------------------------------------------------------------------------------|
      |                      238908          0        .        .        .        .              . |
      |     129,517 (175,081)***‡‡‡          1       18        .       23       21              . |
      |                      228190          0        .        .        .        .              . |
      |                      117763          0        .        .        .        .              . |
      |                                      0        .        .        .        .              . |
      |-------------------------------------------------------------------------------------------|
      |                     188,723          0        .        .        .        .              . |
      |                      135586          0        .        .        .        .              . |
      |           161,349 (197,454)          0        .        .        .        .              . |
      |                      162056          0        .        .        .        .              . |
      +-------------------------------------------------------------------------------------------+
    NOTE: you will probably want then to destring sal1, sal2, and sal3. Also, I couldn't find an effective way to replace "[12]" (and presumably you will have many more with [1], [2]. etc). You may need to do that manually or with a loop, But hopefully this will help get you started!
    Last edited by David Benson; 27 Dec 2018, 16:40.

    Comment


    • #3
      Thank you @ David Benson for the detailed instructions. As you noted, "[12]" is giving me trouble.

      I tried another approach, but the issue remains. Here is what I did:

      Code:
      *get rid of ()
      split salary_today , parse(( )) gen(v)
      *rename
      rename v1 salary //still contains characters
      rename v2 salary_p1
      rename v4 salary_p2
      rename v3 salary_note
      * salary_p1 and salary_p2 can be destringed immediately 
      destring salary_p1 salary_p2, replace ignore(",")
      order salary_p2, after(salary_p1)
      Code:
      . list
      
           +----------------------------------------------------------------------------+
           |                salary_today        salary   salary~1   salary~2   salary~e |
           |----------------------------------------------------------------------------|
        1. | 243,250 (307,840) (253,454)      243,250      307840     253454            |
        2. |           322,043 (342,970)      322,043      342970          .            |
        3. |           279,102 (365,736)      279,102      365736          .            |
        4. |                 126,025[12]   126,025[12]          .          .            |
        5. |                   247,579††     247,579††          .          .            |
           |----------------------------------------------------------------------------|
        6. |                      166218        166218          .          .            |
        7. |                  138,740†††    138,740†††          .          .            |
        8. |                      161349        161349          .          .            |
        9. |           130,646 (204,309)      130,646      204309          .            |
       10. |                      254160        254160          .          .            |
           |----------------------------------------------------------------------------|
       11. |                      238908        238908          .          .            |
       12. |     129,517 (175,081)***‡‡‡      129,517      175081          .     ***‡‡‡ |
       13. |                      228190        228190          .          .            |
       14. |                      117763        117763          .          .            |
       15. |                                                    .          .            |
           |----------------------------------------------------------------------------|
       16. |                     188,723       188,723          .          .            |
       17. |                      135586        135586          .          .            |
       18. |           161,349 (197,454)      161,349      197454          .            |
       19. |                      162056        162056          .          .            |
           +----------------------------------------------------------------------------+
      Now my question is: how can I remove the characters from salary (see observations 4, 5, and 7) and put them salary_note?

      Comment


      • #4
        I would also note that your code also didn't bring over the "††" or "††† " in obs #5 or #7. (And they are still in salary for those obs too).

        So I made a few changes to my code in #2:
        1) I clone salary_today and make a var called salary_copy where I remove all the symbols and then split into salary, salary_p1, and salary_p2. That way salary_today remains unchanged.
        2) I added a loop to remove all of the [1], [2], [3]....[15]. I did up to 15 in the loop, but you can replace that with whatever is the highest number footnote you would like.

        Note, I added a few more observations to your sample data, adding some more brackets (are those footnotes) just to make sure the code could handle it. See obs 13 and 17-20.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str39 salary_today
        "243,250 (307,840) (253,454)"            
        "322,043 (342,970)"                      
        "279,102 (365,736)"                      
        "126,025[12]"                            
        "247,579††"                          
        "166218"                                
        "138,740†††"                      
        "130,646 (204,309)"                      
        "254160"                                
        "238908"                                
        "129,517 (175,081)***‡‡‡"          
        "228190"                                
        "129,517 (135,687) (175,081)***‡‡‡"
        "188,723"                                
        "135586"                                
        "161,349 (197,454)"                      
        "227,025[5]"                            
        "447,125[1]"                            
        "336,125[2]***"                          
        "243,250 (317,840) (463,454)[8]†††"
        end
        
        . list salary_today
        
             +-----------------------------------+
             |                      salary_today |
             |-----------------------------------|
          1. |       243,250 (307,840) (253,454) |
          2. |                 322,043 (342,970) |
          3. |                 279,102 (365,736) |
          4. |                       126,025[12] |
          5. |                         247,579†† |
             |-----------------------------------|
          6. |                            166218 |
          7. |                        138,740††† |
          8. |                 130,646 (204,309) |
          9. |                            254160 |
         10. |                            238908 |
             |-----------------------------------|
         11. |           129,517 (175,081)***‡‡‡ |
         12. |                            228190 |
         13. | 129,517 (135,687) (175,081)***‡‡‡ |
         14. |                           188,723 |
         15. |                            135586 |
             |-----------------------------------|
         16. |                 161,349 (197,454) |
         17. |                        227,025[5] |
         18. |                        447,125[1] |
         19. |                     336,125[2]*** |
         20. | 243,250 (317,840) (463,454)[8]††† |
             +-----------------------------------+
        Code:
        * Create salary_copy (so can make modifications to salary_copy while leaving salary_today untouched) 
        clonevar salary_copy = salary_today
        order salary_copy, after(salary_today)
        
        charlist salary_today
        di r(ascii)
        
        
        * Get rid of the * and "†", and "‡" from salary_today (I had to do brackets later)  
        replace salary_copy = subinstr( salary_copy, "*", "", .)
        replace salary_copy = subinstr( salary_copy, "†", "", .)
        replace salary_copy = subinstr( salary_copy, "‡", "", .)
        
        * NEW - loop that removes [1], [2], [3]...[15]
        * I set this to go from 1 to 15, but choose the highest value in brackets
        forvalue i = 1/15  {
        replace salary_copy = subinstr( salary_copy, "[`i']", "", .)
        }
        
        
        * Split salary_today into words
        split salary_copy, gen(sal)  // creates sal1, sal2, sal3 (which will be renamed to salary, salary_p1, salary_p2)
        gen word_count = wordcount(salary_today)   // may be useful to have
        
        * Finding positions of the various symbols (so can create salary_note)
        gen has_char = 0
        replace has_char = 1 if (strpos(salary_today, "*") | strpos(salary_today, "†") | strpos(salary_today, "‡") | strpos(salary_today, "["))
        
        gen s_pos1 = strpos( salary_today, "*") if strpos( salary_today, "*")>=1
        gen s_pos2 = strpos( salary_today, char(160)) if strpos( salary_today, char(160))>=1
        gen s_pos3 = strpos( salary_today, char(161)) if strpos( salary_today, char(161))>=1
        gen s_pos4 = strpos( salary_today, char(226)) if strpos( salary_today, char(226))>=1
        gen s_pos5 = strpos( salary_today, "[") if strpos( salary_today, "[")>=1
        egen note_start = rowmin( s_pos1- s_pos5) // finding position where first symbol starts
        gen sal_note = substr( salary_today, note_start, .)  // this is where I grab everything after the first symbol and place that in salary_note
        
        * Just labeling some variables
        label var has_char "1 if salary_today contains *, [, †, or ‡"
        label var note_start "1st position in salary_today where special character appears"
        
        destring sal1 sal2 sal3, replace ignore("," "(" ")" )
        format sal1 sal2 sal3 %9.0gc // formatting to comma format
        order sal_note, after(sal3)
        
        * Naming them what you wanted
        rename sal1 salary
        rename sal2 salary_p1
        rename sal3 salary_p2
        rename sal_note salary_note
        
        
        list salary_today salary salary_p1 salary_p2 salary_note, abbrev(12)
        
             +-----------------------------------------------------------------------------------+
             |                      salary_today    salary   salary_p1   salary_p2   salary_note |
             |-----------------------------------------------------------------------------------|
          1. |       243,250 (307,840) (253,454)   243,250     307,840     253,454               |
          2. |                 322,043 (342,970)   322,043     342,970           .               |
          3. |                 279,102 (365,736)   279,102     365,736           .               |
          4. |                       126,025[12]   126,025           .           .          [12] |
          5. |                         247,579††   247,579           .           .            †† |
             |-----------------------------------------------------------------------------------|
          6. |                            166218   166,218           .           .               |
          7. |                        138,740†††   138,740           .           .           ††† |
          8. |                 130,646 (204,309)   130,646     204,309           .               |
          9. |                            254160   254,160           .           .               |
         10. |                            238908   238,908           .           .               |
             |-----------------------------------------------------------------------------------|
         11. |           129,517 (175,081)***‡‡‡   129,517     175,081           .        ***‡‡‡ |
         12. |                            228190   228,190           .           .               |
         13. | 129,517 (135,687) (175,081)***‡‡‡   129,517     135,687     175,081        ***‡‡‡ |
         14. |                           188,723   188,723           .           .               |
         15. |                            135586   135,586           .           .               |
             |-----------------------------------------------------------------------------------|
         16. |                 161,349 (197,454)   161,349     197,454           .               |
         17. |                        227,025[5]   227,025           .           .           [5] |
         18. |                        447,125[1]   447,125           .           .           [1] |
         19. |                     336,125[2]***   336,125           .           .        [2]*** |
         20. | 243,250 (317,840) (463,454)[8]†††   243,250     317,840     463,454        [8]††† |
             +-----------------------------------------------------------------------------------+
        Last edited by David Benson; 28 Dec 2018, 13:32.

        Comment


        • #5
          Late, but If someone want to play with a regex solution, the followings seems to do the task:
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str29 salary_today
          "243,250 (307,840) (253,454)"             
          "126,025[12]"                  
          "247,579††"                                  
          "138,740†††"                                   
          "130,646 (204,309)"            
          "129,517 (175,081)***‡‡‡"
          end
          
          gen salary = salary_today /* keep org */
          
          gen note = ustrregexs(1) if ustrregexm(salary,"\){0,1}(\D\d{0,2}\D+?$)")
          replace salary = ustrregexra(salary,"[,)(]+|\){0,1}(\D\d{0,2}\D+?$)","") 
          split salary , destring
          Code:
          . list , sep(10)
          
               +-------------------------------------------------------------------------------------------+
               |                salary_today                 salary     note   salary1   salary2   salary3 |
               |-------------------------------------------------------------------------------------------|
            1. | 243,250 (307,840) (253,454)   243250 307840 253454             243250    307840    253454 |
            2. |                 126,025[12]                 126025     [12]    126025         .         . |
            3. |                   247,579††                 247579       ††    247579         .         . |
            4. |                  138,740†††                 138740      †††    138740         .         . |
            5. |           130,646 (204,309)          130646 204309             130646    204309         . |
            6. |     129,517 (175,081)***‡‡‡          129517 175081   ***‡‡‡    129517    175081         . |
               +-------------------------------------------------------------------------------------------+

          Comment


          • #6
            David Benson Thanks a lot, your code does the job! I really like how you explain every step.
            Bjarte Aagnes Many thanks! The code is very neat and efficacious.

            Comment


            • #7
              Budu Gulo Glad I could help!
              Bjarte Aagnes You've convinced me that I need to knuckle down and really learn regex! I can't believe that your code can solve this in 3 lines! (Not counting creating a copy of salary_today that you can modify :-).

              Comment


              • #8
                Actually, Bjarte Aagnes , would you mind walking us (me) through the regex part below?

                Code:
                gen note = ustrregexs(1) if ustrregexm(salary,"\){0,1}(\D\d{0,2}\D+?$)")
                replace salary = ustrregexra(salary,"[,)(]+|\){0,1}(\D\d{0,2}\D+?$)","")
                I know that ustrregexm(salary, re) is going to return 1 if the regex matches the string in salary, and that ustrregexs(1) is going to return the first substring that matches from ustrregexm(salary,"\){0,1}(\D\d{0,2}\D+?$)"). But that's about it.

                Comment


                • #9
                  David Benson, Below is some description (I hope I don't introduce errors or confusion):

                  The ustrreg*() functions use the ICU regex engine, http://userguide.icu-project.org/strings/regexp
                  The following description of some regex elements used below is taken from the ICU pages. Text marked with * is added/changed by me:
                  Code:
                  ( ... ) Capturing parentheses *defining subexpression n - see -help ustrregexs- *
                  
                  \       Treat the following character as a literal, suppressing any special meaning.  *escaping*
                  {n,m}   Match between n and m times
                  \d      Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
                  \D      Match any character that is not a decimal digit.
                  +?      Match 1 or more times. Match as few times as possible. *"non-greedy"*
                  $      *End of String Anchor*
                  Code:
                  gen note = ustrregexs(1) if ustrregexm(salary,"\){0,1}(\D\d{0,2}\D+?$)")
                  Match "\){0,1}(\D\d{0,2}\D+?$)" and capture input matching the subexpression "(\D\d{0,2}\D+?$)" to be refered to by ustrregexs().

                  A closer look at the elements:
                  Code:
                  "
                  \){0,1}        right parenthesis zero or one time
                  (             Capture start  
                  \D             non-digit
                  \d{0,2}        zero to two digits
                  \D             non-digit
                  +?             one or more times. Match as few times as possible.
                  $              the end of string
                  )             Capture end
                  "
                  The following replace expression should then be quite easy to read stripping off the characters not wanted including the "note".

                  Below two elements not explaned above is used:
                  Code:
                  [abc]    "character class"/"character set" Match any of the characters a, b or c
                  |        Alternation. A|B matches either A or B.
                  Code:
                  replace salary = ustrregexra(salary,"[,)(]+|\){0,1}(\D\d{0,2}\D+?$)","")
                  
                  "
                  [,)(]+     one ore more of the characters defined
                  |          OR
                  \){0,1}(\D\d{0,2}\D+?$)    "note expression"
                  "
                  Note that within the
                  square brackets [] parentheses have no special meaning and no escaping is necessary.

                  Also, part of the "note"-expression could/should have been shorter,not just copied from the first match expression.
                  Code:
                   replace salary = ustrregexra(salary,"[,)(]+|\D\d{0,2}\D+?$","")
                  For some information on Stata implementations of regular expression see the Statalist post by
                  Hua Peng at https://www.statalist.org/forums/for...ressions/page2
                  Last edited by Bjarte Aagnes; 29 Dec 2018, 07:30.

                  Comment


                  • #10
                    Playing with Regular Expression (re) is usually like walking in a maze. Below code might serves for your specific issue while trying to utilize a memorable re pattern.
                    Code:
                    gen salary_note = ustrregexs(0) if ustrregexm(salary_today,"([^(,) 0-9].*)")
                    
                    gen _s = subinstr(salary_today, salary_note,"",.)
                    split _s, gen(salary) p("(") destring ignore(",)")
                    drop _s

                    Further explanation for the relevant pattern, which also cover cases like [1234] or ***123***:
                    Code:
                    ([^(,) 0-9].*)
                    
                    ( )    Defines a marked subexpression
                    [^ ]   Un-matches characters contained within the brackets (notice the space)
                    .*     Matches one and followed by zero or more characters

                    Comment


                    • #11
                      Bjarte Aagnes , Romalpa Akzo Thanks for the detailed posts! They are super helpful!!!

                      Comment

                      Working...
                      X