Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Removing multiple words at the end of a string variable

    Hello all,

    For my Master Thesis research, I need to fuzzy match two datasets based on company names. To make this process easier, I am cleaning up the names, e.g. by removing generic terms like 'Limited', 'Ltd.', 'Co.' at the end of the company names. I am using the following code to do so:

    Code:
    local to_remove ltd limited inc llc co corp corporation gmbh ag nv bv international int holding pjsc sa se spa plc incorporated holdings aktiengesellschaft coltd as group groep groupe sa/nv
    gen rcompanynamelow_clean = reverse(companynamelow_clean)
    foreach t of local to_remove {
        local trev = reverse(`"`t'"')
        replace companynamelow_clean = reverse(subinword(rcompanynamelow_clean, `"`trev'"', "", 1))  ///
            if strpos(rcompanynamelow_clean, `"`trev'"') == 1
    }
    drop rcompanynamelow_clean
    I find that some companies have multiple generic words at the end of their names:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str100 companyname
    "China Oriental Group Co. Ltd."
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    end
    I first transform all strings into lower case and clean up all punctuation. After I apply the code above, only the last generic term is removed, where I would like all generic terms at the end of the string to be deleted. My question is: how can I change the above code in order to remove all of these terms at the end of the string, and not only the absolute last term?

    With kind regards, and thank you in advance,

    Christian Spek


  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str100 companyname
    "China Oriental Group Co. Ltd."
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    "West Fraser Timber Co. Ltd."  
    end
    
    replace companyname= ustrregexra(" "+ lower(companyname)+ " ", "\b(co|ltd|limited)\b|\.", "")
    Res.:

    Code:
    . l, sep(0)
    
         +--------------------------+
         |              companyname |
         |--------------------------|
      1. |  china oriental group    |
      2. |    west fraser timber    |
      3. |    west fraser timber    |
      4. |    west fraser timber    |
      5. |    west fraser timber    |
      6. |    west fraser timber    |
      7. |    west fraser timber    |
      8. |    west fraser timber    |
      9. |    west fraser timber    |
     10. |    west fraser timber    |
         +--------------------------+

    Comment


    • #3
      Andrew Musau Thank you for your response. Do you also have a suggestion that adds to the code I had above? Perhaps a loop or some sort that looks at every last word of the string?

      Comment


      • #4
        Do you just want a list of words that appear at the end of each observation? What does #2 not accomplish?

        Code:
        sysuse auto, clear
        list make in 1/10, clean
        gen keywords= word(make, -1)
        levelsof keywords in 1/10, local(keywords) sep(|) clean
        display "`keywords'"
        Res.:

        Code:
        . list make in 1/10, clean
        
               make           
          1.   AMC Concord    
          2.   AMC Pacer      
          3.   AMC Spirit     
          4.   Buick Century  
          5.   Buick Electra  
          6.   Buick LeSabre  
          7.   Buick Opel     
          8.   Buick Regal    
          9.   Buick Riviera  
         10.   Buick Skylark  
        
        . 
        . gen keywords= word(make, -1)
        
        . 
        . levelsof keywords in 1/10, local(keywords) sep(|) clean
        Century|Concord|Electra|LeSabre|Opel|Pacer|Regal|Riviera|Skylark|Spirit
        
        . 
        . display "`keywords'"
        Century|Concord|Electra|LeSabre|Opel|Pacer|Regal|Riviera|Skylark|Spirit
        Last edited by Andrew Musau; 16 Aug 2023, 07:38.

        Comment


        • #5
          Dear Andrew,

          The code you gave me also takes away terms at the beginning of the string, and adds a space, which is undesirable for me. What I would like is a code that looks at the last word/term in a given string, and deletes this word if it is included in the list of words I provide. The code I listed in the original post does this already, but does not look again after deleting the first term.

          I hope this clarifies.

          With kind regards,

          Christian Spek

          Comment


          • #6
            The original code I used was suggested by Clyde Schechter in this thread: https://www.statalist.org/forums/for...nd-from-string. Hopefully someone can add to it.

            Comment


            • #7
              Originally posted by Christian Spek View Post
              The code you gave me also takes away terms at the beginning of the string, and adds a space, which is undesirable for me.
              Provide a dataex example with words at the beginning deleted.

              Comment


              • #8
                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str100 companyname str97 companyname_clean
                `"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
                `"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
                `"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
                `"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
                `"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
                end
                Not beginning of the word, but in the middle of the word, which I do not want. 'Groupe' is also one of my words.
                Last edited by Christian Spek; 16 Aug 2023, 08:47.

                Comment


                • #9
                  I do not see the word "Groupe" repeated. "Delhaize" is though. For the modified example below, here is how you can eliminate the word only if it appears at the end of the string.

                  Code:
                  help strtrim()
                  on space management.

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input str100 companyname
                  `"Etablissements Groupe Frères et Cie ""Le Lion"" Groupe"'
                  `"Etablissements Delhaize Frères et Cie ""Le Lion"" Delhaize"'
                  end
                  
                  replace companyname= trim(ustrregexra(" "+ companyname+ " ", `"[\s\b](Delhaize|Groupe)[\b\s]$"', ""))
                  Res.:

                  Code:
                  . l
                  
                       +---------------------------------------------------+
                       |                                       companyname |
                       |---------------------------------------------------|
                    1. |   Etablissements Groupe Frères et Cie ""Le Lion"" |
                    2. | Etablissements Delhaize Frères et Cie ""Le Lion"" |
                       +---------------------------------------------------+

                  Comment

                  Working...
                  X