Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extract content in paranthesis

    Dear Stata users,

    I am using Stata 16 and I have country names in a string variable and would like to extract them. Below is an example of data with the original variable and my desired

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str47 original str23 new
    "XXX TRFD (China)"                                "China"                  
    "SGDTE  (Uruguay) FGRED BHEGR (New Zealand)"      "Uruguay / New Zealand"  
    "GHENDGG (Nigeria)"                               "Nigeria"                
    "MKGSER (South Africa)"                           "South Africa"           
    "BHGED (United States) VGDENW (Vietnam) HGDNNEEE" "United States / Vietnam"
    end
    Thanks in advance!


  • #2
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str47 original str23 new
    "XXX TRFD (China)"                                "China"                  
    "SGDTE  (Uruguay) FGRED BHEGR (New Zealand)"      "Uruguay / New Zealand"  
    "GHENDGG (Nigeria)"                               "Nigeria"                
    "MKGSER (South Africa)"                           "South Africa"          
    "BHGED (United States) VGDENW (Vietnam) HGDNNEEE" "United States / Vietnam"
    end
    
    gen text=original
    gen howmany= (length(original) - length(ustrregexra(original, "\(|\)", "")))/2
    qui sum howmany
    forval i=1/`r(max)'{
        gen wanted`i'=ustrregexra(text,  "^.*\((.*)\).*$", "$1")
        replace text= ustrregexra(text, "\("+wanted`i'+"\)", "", 1)
        replace wanted`i'="" if wanted`i'==text
    }
    drop text
    Res.:

    Code:
    . l
    
         +--------------------------------------------------------------------------------------------------------------------+
         |                                        original                       new   howmany        wanted1         wanted2 |
         |--------------------------------------------------------------------------------------------------------------------|
      1. |                                XXX TRFD (China)                     China         1          China                 |
      2. |      SGDTE  (Uruguay) FGRED BHEGR (New Zealand)     Uruguay / New Zealand         2    New Zealand         Uruguay |
      3. |                               GHENDGG (Nigeria)                   Nigeria         1        Nigeria                 |
      4. |                           MKGSER (South Africa)              South Africa         1   South Africa                 |
      5. | BHGED (United States) VGDENW (Vietnam) HGDNNEEE   United States / Vietnam         2        Vietnam   United States |
         +--------------------------------------------------------------------------------------------------------------------+
    Last edited by Andrew Musau; 17 Feb 2022, 06:09.

    Comment


    • #3
      Thanks so much Andrew Musau for the hints. This is super helpful!!

      Comment


      • #4
        Is it possible to find unique values as in the example below

        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str42 original str7 new
        "SGDTE  (Uruguay) FGRED BHEGR (Uruguay)" "Uruguay"
        end

        Thanks in advance

        Comment


        • #5
          I think the code in #2 already does that. Including your example plus one other:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str99 original str23 new
          "XXX TRFD (China)"                                "China"                  
          "SGDTE  (Uruguay) FGRED BHEGR (New Zealand)"      "Uruguay / New Zealand"  
          "GHENDGG (Nigeria)"                               "Nigeria"                
          "MKGSER (South Africa)"                           "South Africa"          
          "BHGED (United States) VGDENW (Vietnam) HGDNNEEE" "United States / Vietnam"
          "SGDTE (Uruguay) FGRED BHEGR (Uruguay)" "Uruguay"
          "SGDTE  (Uruguay) FGRED (Peru) BHEGR (Uruguay) XYZ (France) LMZ (Peru) (Uruguay)" "Uruguay"
          end
          
          gen text=original
          gen howmany= (length(original) - length(ustrregexra(original, "\(|\)", "")))/2
          qui sum howmany
          forval i=1/`r(max)'{
              gen wanted`i'=ustrregexra(text,  "^.*\((.*)\).*$", "$1")
              replace text= ustrregexra(text, "\("+wanted`i'+"\)", "", 1)
              replace wanted`i'="" if wanted`i'==text
              sort wanted`i'
              if missing(wanted`i'[_N]){
                  drop wanted`i'
              }
          }
          drop text
          I just add to the code to eliminate empty variables that are due to duplicates.

          Res.:

          Code:
          . l original howmany wanted*, sep(0)
          
               +------------------------------------------------------------------------------------------------------------------------------------+
               |                                                                        original   howmany        wanted1         wanted2   wanted3 |
               |------------------------------------------------------------------------------------------------------------------------------------|
            1. |                                                               GHENDGG (Nigeria)         1        Nigeria                           |
            2. |                                 BHGED (United States) VGDENW (Vietnam) HGDNNEEE         2        Vietnam   United States           |
            3. |                                      SGDTE  (Uruguay) FGRED BHEGR (New Zealand)         2    New Zealand         Uruguay           |
            4. |                                                           MKGSER (South Africa)         1   South Africa                           |
            5. |                                                                XXX TRFD (China)         1          China                           |
            6. |                                           SGDTE (Uruguay) FGRED BHEGR (Uruguay)         2        Uruguay                           |
            7. | SGDTE  (Uruguay) FGRED (Peru) BHEGR (Uruguay) XYZ (France) LMZ (Peru) (Uruguay)         6        Uruguay            Peru    France |
               +------------------------------------------------------------------------------------------------------------------------------------+

          Comment


          • #6
            Andrew Musau Thanks a bunch! This is the desired output.

            Comment


            • #7
              Another way to go:
              Code:
              split original, gen(_) p("(" ")")
              
              forval i = 4(2)`r(nvars)' {
                  replace _2 = _2 + ")" + _`i' if !strpos(_2, _`i')
              }
              
              split _2, gen(wanted) p(")")
              drop _*

              Comment


              • #8
                Code:
                gen new2 = ustrregexra(original, "(:?^|\)).*?(:?\(|$)", "|") 
                
                split new2 , parse("|") 
                
                replace new2 = ""
                
                foreach v of varlist `r(varlist)' {
                      
                  replace new2 = new2 + `v' +  (strlen(`v') > 0 ) * " / "   if !ustrpos(new2, `v')    
                
                }
                
                drop `r(varlist)'
                
                replace new2 = ustrregexrf(new2, "[/ ]+$","") // strip of ending "/"
                Code:
                . list new2 , noobs sep(100)
                
                  +-------------------------+
                  |                    new2 |
                  |-------------------------|
                  |                   China |
                  |   Uruguay / New Zealand |
                  |                 Nigeria |
                  |            South Africa |
                  | United States / Vietnam |
                  |                 Uruguay |
                  | Uruguay / Peru / France |
                  +-------------------------+
                Last edited by Bjarte Aagnes; 18 Feb 2022, 07:17.

                Comment


                • #9
                  Thanks Romalpa Akzo for your guidance

                  Comment


                  • #10
                    Bjarte Aagnes I am grateful for your awesome solution!!!

                    Comment

                    Working...
                    X