Generate multiple variables from a variable containing symbols and numbers

Budu Gulo

Join Date: Feb 2018
Posts: 238

Generate multiple variables from a variable containing symbols and numbers

27 Dec 2018, 13:33

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str29 salary_today
"243,250 (307,840) (253,454)"  
"322,043 (342,970)"            
"279,102 (365,736)"            
"126,025[12]"                  
"247,579††"                
"166218"                       
"138,740†††"             
"161349"                       
"130,646 (204,309)"            
"254160"                       
"238908"                       
"129,517 (175,081)***‡‡‡"
"228190"                       
"117763"                       
""                             
"188,723"                      
"135586"                       
"161,349 (197,454)"            
"162056"                       
end

The variable salary_today looks like the following

Code:

. list

     +-----------------------------+
     |                salary_today |
     |-----------------------------|
  1. | 243,250 (307,840) (253,454) |
  2. |           322,043 (342,970) |
  3. |           279,102 (365,736) |
  4. |                 126,025[12] |
  5. |                   247,579†† |
     |-----------------------------|
  6. |                      166218 |
  7. |                  138,740††† |
  8. |                      161349 |
  9. |           130,646 (204,309) |
 10. |                      254160 |
     |-----------------------------|
 11. |                      238908 |
 12. |     129,517 (175,081)***‡‡‡ |
 13. |                      228190 |
 14. |                      117763 |
 15. |                             |
     |-----------------------------|
 16. |                     188,723 |
 17. |                      135586 |
 18. |           161,349 (197,454) |
 19. |                      162056 |
     +-----------------------------+

I want to generate four variables salary, salary_p1, salary_p2, and salary_note : salary will contain the numbers before parenthesis; salary_p1 will contain the numbers between the first two parentheses; salary_p2 will contain the numbers between the last parentheses ; salary_note will contain all the symbols ( including [12] in observation 4).

For example, for observation 12 salary will be 129517, salary_p1 will be 175081, salary_p2 will be missing , and salary_note will be ***‡‡‡.

Tags: None

David Benson

Join Date: Oct 2018
Posts: 489

27 Dec 2018, 15:40

I did it as a 4-step process (but others may come up with something better, especially if they are good with regex (which I am not):
Step 1. Create salary_orig (so can make modifications to salary_today)
Step 2. Get rid of [], the * and "†", and "‡") from salary_today
Step 3. Split salary_today into words (can then destring to convert them to numbers)
This will gen salary, salary_p2, salary_p3
Step 4. Use regex or substr to get the symbols (assuming these are always far to the right)

Code:

* Create salary_orig (so can make modifications to salary_today)
clonevar salary_orig = salary_today
order salary_orig, before(salary_today)

ssc install charlist // in case you don't have it
charlist salary_today
di r(ascii)

/*
ASCII characters
  40 = "("   , 41= ")"
  42 = *
  91 = [  , 93=]
  The characters "†" and "‡" are 160, 161, 226
*/

* Get rid of the * and "†", and "‡" from salary_today (I had to do brackets later)
replace salary_today = subinstr( salary_today, "*", "", .)
replace salary_today = subinstr( salary_today, "†", "", .)
replace salary_today = subinstr( salary_today, "‡", "", .)

* Split salary_today into words
split salary_today, gen(sal)  // creates sal1, sal2, sal3 (which can be renamed to salary, salary_p1, salary_p2)
* split() defaults to splitting at spaces
gen word_count = wordcount(salary_orig)   // may be useful to have

* Finding positions of the various symbols (so can create salary_note)
gen has_char = 0
replace has_char = 1 if (strpos(salary_today, "*") | strpos(salary_today, "†") | strpos(salary_today, "‡") | strpos(salary_today, "["))
* Above is not strictly necessary, but nice to find if salary_today contains any of the listed characters 

gen s_pos1 = strpos( salary_orig, "*") if strpos( salary_orig, "*")>=1
gen s_pos2 = strpos( salary_orig, char(160)) if strpos( salary_orig, char(160))>=1
gen s_pos3 = strpos( salary_orig, char(161)) if strpos( salary_orig, char(161))>=1
gen s_pos4 = strpos( salary_orig, char(226)) if strpos( salary_orig, char(226))>=1
gen s_pos_bracket = strpos( salary_orig, "[") if strpos( salary_orig, "[")>=1
egen note_start = rowmin( s_pos1- s_pos_bracket)  // finding position where first symbol starts
gen sal_note = substr( salary_orig, note_start, .)

After all that you end up with

Code:

. list salary_orig sal1 sal2 sal3 note_start sal_note , noobs abbrev(12)

  +-------------------------------------------------------------------------------------------+
  |                 salary_orig          sal1        sal2        sal3   note_start   sal_note |
  |-------------------------------------------------------------------------------------------|
  | 243,250 (307,840) (253,454)       243,250   (307,840)   (253,454)            .            |
  |           322,043 (342,970)       322,043   (342,970)                        .            |
  |           279,102 (365,736)       279,102   (365,736)                        .            |
  |                 126,025[12]   126,025[12]                                    8       [12] |
  |                   247,579††       247,579                                    8         †† |
  |-------------------------------------------------------------------------------------------|
  |                      166218        166218                                    .            |
  |                  138,740†††       138,740                                    8        ††† |
  |                      161349        161349                                    .            |
  |           130,646 (204,309)       130,646   (204,309)                        .            |
  |                      254160        254160                                    .            |
  |-------------------------------------------------------------------------------------------|
  |                      238908        238908                                    .            |
  |     129,517 (175,081)***‡‡‡       129,517   (175,081)                       18     ***‡‡‡ |
  |                      228190        228190                                    .            |
  |                      117763        117763                                    .            |
  |                                                                              .            |
  |-------------------------------------------------------------------------------------------|
  |                     188,723       188,723                                    .            |
  |                      135586        135586                                    .            |
  |           161,349 (197,454)       161,349   (197,454)                        .            |
  |                      162056        162056                                    .            |
  +-------------------------------------------------------------------------------------------+

. list salary_orig has_char s_pos1- s_pos_bracket, noobs abbrev(12)

  +-------------------------------------------------------------------------------------------+
  |                 salary_orig   has_char   s_pos1   s_pos2   s_pos3   s_pos4   s_pos_brac~t |
  |-------------------------------------------------------------------------------------------|
  | 243,250 (307,840) (253,454)          0        .        .        .        .              . |
  |           322,043 (342,970)          0        .        .        .        .              . |
  |           279,102 (365,736)          0        .        .        .        .              . |
  |                 126,025[12]          1        .        .        .        .              8 |
  |                   247,579††          1        .       10        .        8              . |
  |-------------------------------------------------------------------------------------------|
  |                      166218          0        .        .        .        .              . |
  |                  138,740†††          1        .       10        .        8              . |
  |                      161349          0        .        .        .        .              . |
  |           130,646 (204,309)          0        .        .        .        .              . |
  |                      254160          0        .        .        .        .              . |
  |-------------------------------------------------------------------------------------------|
  |                      238908          0        .        .        .        .              . |
  |     129,517 (175,081)***‡‡‡          1       18        .       23       21              . |
  |                      228190          0        .        .        .        .              . |
  |                      117763          0        .        .        .        .              . |
  |                                      0        .        .        .        .              . |
  |-------------------------------------------------------------------------------------------|
  |                     188,723          0        .        .        .        .              . |
  |                      135586          0        .        .        .        .              . |
  |           161,349 (197,454)          0        .        .        .        .              . |
  |                      162056          0        .        .        .        .              . |
  +-------------------------------------------------------------------------------------------+

NOTE: you will probably want then to destring sal1, sal2, and sal3. Also, I couldn't find an effective way to replace "[12]" (and presumably you will have many more with [1], [2]. etc). You may need to do that manually or with a loop, But hopefully this will help get you started!

Last edited by David Benson; 27 Dec 2018, 16:40.

Comment

Budu Gulo

Join Date: Feb 2018
Posts: 238

28 Dec 2018, 06:43

Thank you @ David Benson for the detailed instructions. As you noted, "[12]" is giving me trouble.

I tried another approach, but the issue remains. Here is what I did:

Code:

*get rid of ()
split salary_today , parse(( )) gen(v)
*rename
rename v1 salary //still contains characters
rename v2 salary_p1
rename v4 salary_p2
rename v3 salary_note
* salary_p1 and salary_p2 can be destringed immediately 
destring salary_p1 salary_p2, replace ignore(",")
order salary_p2, after(salary_p1)

Code:

. list

     +----------------------------------------------------------------------------+
     |                salary_today        salary   salary~1   salary~2   salary~e |
     |----------------------------------------------------------------------------|
  1. | 243,250 (307,840) (253,454)      243,250      307840     253454            |
  2. |           322,043 (342,970)      322,043      342970          .            |
  3. |           279,102 (365,736)      279,102      365736          .            |
  4. |                 126,025[12]   126,025[12]          .          .            |
  5. |                   247,579††     247,579††          .          .            |
     |----------------------------------------------------------------------------|
  6. |                      166218        166218          .          .            |
  7. |                  138,740†††    138,740†††          .          .            |
  8. |                      161349        161349          .          .            |
  9. |           130,646 (204,309)      130,646      204309          .            |
 10. |                      254160        254160          .          .            |
     |----------------------------------------------------------------------------|
 11. |                      238908        238908          .          .            |
 12. |     129,517 (175,081)***‡‡‡      129,517      175081          .     ***‡‡‡ |
 13. |                      228190        228190          .          .            |
 14. |                      117763        117763          .          .            |
 15. |                                                    .          .            |
     |----------------------------------------------------------------------------|
 16. |                     188,723       188,723          .          .            |
 17. |                      135586        135586          .          .            |
 18. |           161,349 (197,454)      161,349      197454          .            |
 19. |                      162056        162056          .          .            |
     +----------------------------------------------------------------------------+

Now my question is: how can I remove the characters from salary (see observations 4, 5, and 7) and put them salary_note?

Comment

David Benson

Join Date: Oct 2018
Posts: 489

28 Dec 2018, 13:26

I would also note that your code also didn't bring over the "††" or "††† " in obs #5 or #7. (And they are still in salary for those obs too).

So I made a few changes to my code in #2:
1) I clone salary_today and make a var called salary_copy where I remove all the symbols and then split into salary, salary_p1, and salary_p2. That way salary_today remains unchanged.
2) I added a loop to remove all of the [1], [2], [3]....[15]. I did up to 15 in the loop, but you can replace that with whatever is the highest number footnote you would like.

Note, I added a few more observations to your sample data, adding some more brackets (are those footnotes) just to make sure the code could handle it. See obs 13 and 17-20.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str39 salary_today
"243,250 (307,840) (253,454)"            
"322,043 (342,970)"                      
"279,102 (365,736)"                      
"126,025[12]"                            
"247,579††"                          
"166218"                                
"138,740†††"                      
"130,646 (204,309)"                      
"254160"                                
"238908"                                
"129,517 (175,081)***‡‡‡"          
"228190"                                
"129,517 (135,687) (175,081)***‡‡‡"
"188,723"                                
"135586"                                
"161,349 (197,454)"                      
"227,025[5]"                            
"447,125[1]"                            
"336,125[2]***"                          
"243,250 (317,840) (463,454)[8]†††"
end

. list salary_today

     +-----------------------------------+
     |                      salary_today |
     |-----------------------------------|
  1. |       243,250 (307,840) (253,454) |
  2. |                 322,043 (342,970) |
  3. |                 279,102 (365,736) |
  4. |                       126,025[12] |
  5. |                         247,579†† |
     |-----------------------------------|
  6. |                            166218 |
  7. |                        138,740††† |
  8. |                 130,646 (204,309) |
  9. |                            254160 |
 10. |                            238908 |
     |-----------------------------------|
 11. |           129,517 (175,081)***‡‡‡ |
 12. |                            228190 |
 13. | 129,517 (135,687) (175,081)***‡‡‡ |
 14. |                           188,723 |
 15. |                            135586 |
     |-----------------------------------|
 16. |                 161,349 (197,454) |
 17. |                        227,025[5] |
 18. |                        447,125[1] |
 19. |                     336,125[2]*** |
 20. | 243,250 (317,840) (463,454)[8]††† |
     +-----------------------------------+

Code:

* Create salary_copy (so can make modifications to salary_copy while leaving salary_today untouched) 
clonevar salary_copy = salary_today
order salary_copy, after(salary_today)

charlist salary_today
di r(ascii)


* Get rid of the * and "†", and "‡" from salary_today (I had to do brackets later)  
replace salary_copy = subinstr( salary_copy, "*", "", .)
replace salary_copy = subinstr( salary_copy, "†", "", .)
replace salary_copy = subinstr( salary_copy, "‡", "", .)

* NEW - loop that removes [1], [2], [3]...[15]
* I set this to go from 1 to 15, but choose the highest value in brackets
forvalue i = 1/15  {
replace salary_copy = subinstr( salary_copy, "[`i']", "", .)
}


* Split salary_today into words
split salary_copy, gen(sal)  // creates sal1, sal2, sal3 (which will be renamed to salary, salary_p1, salary_p2)
gen word_count = wordcount(salary_today)   // may be useful to have

* Finding positions of the various symbols (so can create salary_note)
gen has_char = 0
replace has_char = 1 if (strpos(salary_today, "*") | strpos(salary_today, "†") | strpos(salary_today, "‡") | strpos(salary_today, "["))

gen s_pos1 = strpos( salary_today, "*") if strpos( salary_today, "*")>=1
gen s_pos2 = strpos( salary_today, char(160)) if strpos( salary_today, char(160))>=1
gen s_pos3 = strpos( salary_today, char(161)) if strpos( salary_today, char(161))>=1
gen s_pos4 = strpos( salary_today, char(226)) if strpos( salary_today, char(226))>=1
gen s_pos5 = strpos( salary_today, "[") if strpos( salary_today, "[")>=1
egen note_start = rowmin( s_pos1- s_pos5) // finding position where first symbol starts
gen sal_note = substr( salary_today, note_start, .)  // this is where I grab everything after the first symbol and place that in salary_note

* Just labeling some variables
label var has_char "1 if salary_today contains *, [, †, or ‡"
label var note_start "1st position in salary_today where special character appears"

destring sal1 sal2 sal3, replace ignore("," "(" ")" )
format sal1 sal2 sal3 %9.0gc // formatting to comma format
order sal_note, after(sal3)

* Naming them what you wanted
rename sal1 salary
rename sal2 salary_p1
rename sal3 salary_p2
rename sal_note salary_note


list salary_today salary salary_p1 salary_p2 salary_note, abbrev(12)

     +-----------------------------------------------------------------------------------+
     |                      salary_today    salary   salary_p1   salary_p2   salary_note |
     |-----------------------------------------------------------------------------------|
  1. |       243,250 (307,840) (253,454)   243,250     307,840     253,454               |
  2. |                 322,043 (342,970)   322,043     342,970           .               |
  3. |                 279,102 (365,736)   279,102     365,736           .               |
  4. |                       126,025[12]   126,025           .           .          [12] |
  5. |                         247,579††   247,579           .           .            †† |
     |-----------------------------------------------------------------------------------|
  6. |                            166218   166,218           .           .               |
  7. |                        138,740†††   138,740           .           .           ††† |
  8. |                 130,646 (204,309)   130,646     204,309           .               |
  9. |                            254160   254,160           .           .               |
 10. |                            238908   238,908           .           .               |
     |-----------------------------------------------------------------------------------|
 11. |           129,517 (175,081)***‡‡‡   129,517     175,081           .        ***‡‡‡ |
 12. |                            228190   228,190           .           .               |
 13. | 129,517 (135,687) (175,081)***‡‡‡   129,517     135,687     175,081        ***‡‡‡ |
 14. |                           188,723   188,723           .           .               |
 15. |                            135586   135,586           .           .               |
     |-----------------------------------------------------------------------------------|
 16. |                 161,349 (197,454)   161,349     197,454           .               |
 17. |                        227,025[5]   227,025           .           .           [5] |
 18. |                        447,125[1]   447,125           .           .           [1] |
 19. |                     336,125[2]***   336,125           .           .        [2]*** |
 20. | 243,250 (317,840) (463,454)[8]†††   243,250     317,840     463,454        [8]††† |
     +-----------------------------------------------------------------------------------+

Last edited by David Benson; 28 Dec 2018, 13:32.

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

28 Dec 2018, 15:20

Late, but If someone want to play with a regex solution, the followings seems to do the task:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str29 salary_today
"243,250 (307,840) (253,454)"             
"126,025[12]"                  
"247,579††"                                  
"138,740†††"                                   
"130,646 (204,309)"            
"129,517 (175,081)***‡‡‡"
end

gen salary = salary_today /* keep org */

gen note = ustrregexs(1) if ustrregexm(salary,"\){0,1}(\D\d{0,2}\D+?$)")
replace salary = ustrregexra(salary,"[,)(]+|\){0,1}(\D\d{0,2}\D+?$)","") 
split salary , destring

Code:

. list , sep(10)

     +-------------------------------------------------------------------------------------------+
     |                salary_today                 salary     note   salary1   salary2   salary3 |
     |-------------------------------------------------------------------------------------------|
  1. | 243,250 (307,840) (253,454)   243250 307840 253454             243250    307840    253454 |
  2. |                 126,025[12]                 126025     [12]    126025         .         . |
  3. |                   247,579††                 247579       ††    247579         .         . |
  4. |                  138,740†††                 138740      †††    138740         .         . |
  5. |           130,646 (204,309)          130646 204309             130646    204309         . |
  6. |     129,517 (175,081)***‡‡‡          129517 175081   ***‡‡‡    129517    175081         . |
     +-------------------------------------------------------------------------------------------+

Comment

Budu Gulo

Join Date: Feb 2018

Posts: 238
#6

28 Dec 2018, 17:52

David Benson Thanks a lot, your code does the job! I really like how you explain every step.
Bjarte Aagnes Many thanks! The code is very neat and efficacious.
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#7

28 Dec 2018, 18:37

Budu Gulo Glad I could help!
Bjarte Aagnes You've convinced me that I need to knuckle down and really learn regex! I can't believe that your code can solve this in 3 lines! (Not counting creating a copy of salary_today that you can modify :-).
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#8

28 Dec 2018, 22:02

Actually, Bjarte Aagnes , would you mind walking us (me) through the regex part below?

Code:

gen note = ustrregexs(1) if ustrregexm(salary,"\){0,1}(\D\d{0,2}\D+?$)") replace salary = ustrregexra(salary,"[,)(]+|\){0,1}(\D\d{0,2}\D+?$)","")

I know that ustrregexm(salary, re) is going to return 1 if the regex matches the string in salary, and that ustrregexs(1) is going to return the first substring that matches from ustrregexm(salary,"\){0,1}(\D\d{0,2}\D+?$)"). But that's about it.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

29 Dec 2018, 07:04

David Benson, Below is some description (I hope I don't introduce errors or confusion):

The ustrreg*() functions use the ICU regex engine, http://userguide.icu-project.org/strings/regexp
The following description of some regex elements used below is taken from the ICU pages. Text marked with * is added/changed by me:

Code:

( ... ) Capturing parentheses *defining subexpression n - see -help ustrregexs- *

\       Treat the following character as a literal, suppressing any special meaning.  *escaping*
{n,m}   Match between n and m times
\d      Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
\D      Match any character that is not a decimal digit.
+?      Match 1 or more times. Match as few times as possible. *"non-greedy"*
$      *End of String Anchor*

Code:

gen note = ustrregexs(1) if ustrregexm(salary,"\){0,1}(\D\d{0,2}\D+?$)")

Match "\){0,1}(\D\d{0,2}\D+?$)" and capture input matching the subexpression "(\D\d{0,2}\D+?$)" to be refered to by ustrregexs().

A closer look at the elements:

Code:

"
\){0,1}        right parenthesis zero or one time
(             Capture start  
\D             non-digit
\d{0,2}        zero to two digits
\D             non-digit
+?             one or more times. Match as few times as possible.
$              the end of string
)             Capture end
"

The following replace expression should then be quite easy to read stripping off the characters not wanted including the "note".

Below two elements not explaned above is used:

Code:

[abc]    "character class"/"character set" Match any of the characters a, b or c
|        Alternation. A|B matches either A or B.

Code:

replace salary = ustrregexra(salary,"[,)(]+|\){0,1}(\D\d{0,2}\D+?$)","")

"
[,)(]+     one ore more of the characters defined
|          OR
\){0,1}(\D\d{0,2}\D+?$)    "note expression"
"

Note that within the square brackets [] parentheses have no special meaning and no escaping is necessary.

Also, part of the "note"-expression could/should have been shorter,not just copied from the first match expression.

Code:

 replace salary = ustrregexra(salary,"[,)(]+|\D\d{0,2}\D+?$","")

For some information on Stata implementations of regular expression see the Statalist post by Hua Peng at https://www.statalist.org/forums/for...ressions/page2

Last edited by Bjarte Aagnes; 29 Dec 2018, 07:30.

Comment

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

#10

29 Dec 2018, 09:09

Playing with Regular Expression (re) is usually like walking in a maze. Below code might serves for your specific issue while trying to utilize a memorable re pattern.

Code:

gen salary_note = ustrregexs(0) if ustrregexm(salary_today,"([^(,) 0-9].*)")

gen _s = subinstr(salary_today, salary_note,"",.)
split _s, gen(salary) p("(") destring ignore(",)")
drop _s

Further explanation for the relevant pattern, which also cover cases like [1234] or ***123***:

Code:

([^(,) 0-9].*)

( )    Defines a marked subexpression
[^ ]   Un-matches characters contained within the brackets (notice the space)
.*     Matches one and followed by zero or more characters

Comment

David Benson

Join Date: Oct 2018

Posts: 489
#11

29 Dec 2018, 13:00

Bjarte Aagnes , Romalpa Akzo Thanks for the detailed posts! They are super helpful!!!
Comment

Announcement

Generate multiple variables from a variable containing symbols and numbers

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment