Extract content in paranthesis

Stephen Okiya

Join Date: Jul 2025
Posts: 280

Extract content in paranthesis

17 Feb 2022, 02:36

Dear Stata users,

I am using Stata 16 and I have country names in a string variable and would like to extract them. Below is an example of data with the original variable and my desired

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str47 original str23 new
"XXX TRFD (China)"                                "China"                  
"SGDTE  (Uruguay) FGRED BHEGR (New Zealand)"      "Uruguay / New Zealand"  
"GHENDGG (Nigeria)"                               "Nigeria"                
"MKGSER (South Africa)"                           "South Africa"           
"BHGED (United States) VGDENW (Vietnam) HGDNNEEE" "United States / Vietnam"
end

Thanks in advance!

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10191

17 Feb 2022, 04:58

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str47 original str23 new
"XXX TRFD (China)"                                "China"                  
"SGDTE  (Uruguay) FGRED BHEGR (New Zealand)"      "Uruguay / New Zealand"  
"GHENDGG (Nigeria)"                               "Nigeria"                
"MKGSER (South Africa)"                           "South Africa"          
"BHGED (United States) VGDENW (Vietnam) HGDNNEEE" "United States / Vietnam"
end

gen text=original
gen howmany= (length(original) - length(ustrregexra(original, "\(|\)", "")))/2
qui sum howmany
forval i=1/`r(max)'{
    gen wanted`i'=ustrregexra(text,  "^.*\((.*)\).*$", "$1")
    replace text= ustrregexra(text, "\("+wanted`i'+"\)", "", 1)
    replace wanted`i'="" if wanted`i'==text
}
drop text

Res.:

Code:

. l

     +--------------------------------------------------------------------------------------------------------------------+
     |                                        original                       new   howmany        wanted1         wanted2 |
     |--------------------------------------------------------------------------------------------------------------------|
  1. |                                XXX TRFD (China)                     China         1          China                 |
  2. |      SGDTE  (Uruguay) FGRED BHEGR (New Zealand)     Uruguay / New Zealand         2    New Zealand         Uruguay |
  3. |                               GHENDGG (Nigeria)                   Nigeria         1        Nigeria                 |
  4. |                           MKGSER (South Africa)              South Africa         1   South Africa                 |
  5. | BHGED (United States) VGDENW (Vietnam) HGDNNEEE   United States / Vietnam         2        Vietnam   United States |
     +--------------------------------------------------------------------------------------------------------------------+

Last edited by Andrew Musau; 17 Feb 2022, 05:09.

Comment

Stephen Okiya

Join Date: Jul 2025

Posts: 280
#3

17 Feb 2022, 08:55

Thanks so much Andrew Musau for the hints. This is super helpful!!
Comment
Stephen Okiya

Join Date: Jul 2025

Posts: 280
#4

17 Feb 2022, 09:50

Is it possible to find unique values as in the example below

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str42 original str7 new "SGDTE (Uruguay) FGRED BHEGR (Uruguay)" "Uruguay" end

Thanks in advance
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10191

17 Feb 2022, 12:11

I think the code in #2 already does that. Including your example plus one other:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str99 original str23 new
"XXX TRFD (China)"                                "China"                  
"SGDTE  (Uruguay) FGRED BHEGR (New Zealand)"      "Uruguay / New Zealand"  
"GHENDGG (Nigeria)"                               "Nigeria"                
"MKGSER (South Africa)"                           "South Africa"          
"BHGED (United States) VGDENW (Vietnam) HGDNNEEE" "United States / Vietnam"
"SGDTE (Uruguay) FGRED BHEGR (Uruguay)" "Uruguay"
"SGDTE  (Uruguay) FGRED (Peru) BHEGR (Uruguay) XYZ (France) LMZ (Peru) (Uruguay)" "Uruguay"
end

gen text=original
gen howmany= (length(original) - length(ustrregexra(original, "\(|\)", "")))/2
qui sum howmany
forval i=1/`r(max)'{
    gen wanted`i'=ustrregexra(text,  "^.*\((.*)\).*$", "$1")
    replace text= ustrregexra(text, "\("+wanted`i'+"\)", "", 1)
    replace wanted`i'="" if wanted`i'==text
    sort wanted`i'
    if missing(wanted`i'[_N]){
        drop wanted`i'
    }
}
drop text

I just add to the code to eliminate empty variables that are due to duplicates.

Res.:

Code:

. l original howmany wanted*, sep(0)

     +------------------------------------------------------------------------------------------------------------------------------------+
     |                                                                        original   howmany        wanted1         wanted2   wanted3 |
     |------------------------------------------------------------------------------------------------------------------------------------|
  1. |                                                               GHENDGG (Nigeria)         1        Nigeria                           |
  2. |                                 BHGED (United States) VGDENW (Vietnam) HGDNNEEE         2        Vietnam   United States           |
  3. |                                      SGDTE  (Uruguay) FGRED BHEGR (New Zealand)         2    New Zealand         Uruguay           |
  4. |                                                           MKGSER (South Africa)         1   South Africa                           |
  5. |                                                                XXX TRFD (China)         1          China                           |
  6. |                                           SGDTE (Uruguay) FGRED BHEGR (Uruguay)         2        Uruguay                           |
  7. | SGDTE  (Uruguay) FGRED (Peru) BHEGR (Uruguay) XYZ (France) LMZ (Peru) (Uruguay)         6        Uruguay            Peru    France |
     +------------------------------------------------------------------------------------------------------------------------------------+

Comment

Stephen Okiya

Join Date: Jul 2025

Posts: 280
#6

17 Feb 2022, 12:24

Andrew Musau Thanks a bunch! This is the desired output.
Comment

Romalpa Akzo

Join Date: Oct 2017
Posts: 369

17 Feb 2022, 18:17

Another way to go:

Code:

split original, gen(_) p("(" ")")

forval i = 4(2)`r(nvars)' {
    replace _2 = _2 + ")" + _`i' if !strpos(_2, _`i')
}

split _2, gen(wanted) p(")")
drop _*

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

18 Feb 2022, 06:07

Code:

gen new2 = ustrregexra(original, "(:?^|\)).*?(:?\(|$)", "|") 

split new2 , parse("|") 

replace new2 = ""

foreach v of varlist `r(varlist)' {
      
  replace new2 = new2 + `v' +  (strlen(`v') > 0 ) * " / "   if !ustrpos(new2, `v')    

}

drop `r(varlist)'

replace new2 = ustrregexrf(new2, "[/ ]+$","") // strip of ending "/"

Code:

. list new2 , noobs sep(100)

  +-------------------------+
  |                    new2 |
  |-------------------------|
  |                   China |
  |   Uruguay / New Zealand |
  |                 Nigeria |
  |            South Africa |
  | United States / Vietnam |
  |                 Uruguay |
  | Uruguay / Peru / France |
  +-------------------------+

Last edited by Bjarte Aagnes; 18 Feb 2022, 06:17.

Comment

Stephen Okiya

Join Date: Jul 2025

Posts: 280
#9

23 Feb 2022, 03:54

Thanks Romalpa Akzo for your guidance
Comment
Stephen Okiya

Join Date: Jul 2025

Posts: 280
#10

23 Feb 2022, 03:55

Bjarte Aagnes I am grateful for your awesome solution!!!
Comment

Announcement

Extract content in paranthesis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment