Removing multiple words at the end of a string variable

Christian Spek

Join Date: Aug 2023

Posts: 8
#1

Removing multiple words at the end of a string variable

16 Aug 2023, 07:07

Hello all,

For my Master Thesis research, I need to fuzzy match two datasets based on company names. To make this process easier, I am cleaning up the names, e.g. by removing generic terms like 'Limited', 'Ltd.', 'Co.' at the end of the company names. I am using the following code to do so:

Code:

local to_remove ltd limited inc llc co corp corporation gmbh ag nv bv international int holding pjsc sa se spa plc incorporated holdings aktiengesellschaft coltd as group groep groupe sa/nv gen rcompanynamelow_clean = reverse(companynamelow_clean) foreach t of local to_remove { local trev = reverse(`"`t'"') replace companynamelow_clean = reverse(subinword(rcompanynamelow_clean, `"`trev'"', "", 1)) /// if strpos(rcompanynamelow_clean, `"`trev'"') == 1 } drop rcompanynamelow_clean

I find that some companies have multiple generic words at the end of their names:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str100 companyname "China Oriental Group Co. Ltd." "West Fraser Timber Co. Ltd." "West Fraser Timber Co. Ltd." "West Fraser Timber Co. Ltd." "West Fraser Timber Co. Ltd." "West Fraser Timber Co. Ltd." "West Fraser Timber Co. Ltd." "West Fraser Timber Co. Ltd." "West Fraser Timber Co. Ltd." "West Fraser Timber Co. Ltd." end

I first transform all strings into lower case and clean up all punctuation. After I apply the code above, only the last generic term is removed, where I would like all generic terms at the end of the string to be deleted. My question is: how can I change the above code in order to remove all of these terms at the end of the string, and not only the absolute last term?

With kind regards, and thank you in advance,

Christian Spek
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10069

16 Aug 2023, 07:23

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str100 companyname
"China Oriental Group Co. Ltd."
"West Fraser Timber Co. Ltd."  
"West Fraser Timber Co. Ltd."  
"West Fraser Timber Co. Ltd."  
"West Fraser Timber Co. Ltd."  
"West Fraser Timber Co. Ltd."  
"West Fraser Timber Co. Ltd."  
"West Fraser Timber Co. Ltd."  
"West Fraser Timber Co. Ltd."  
"West Fraser Timber Co. Ltd."  
end

replace companyname= ustrregexra(" "+ lower(companyname)+ " ", "\b(co|ltd|limited)\b|\.", "")

Res.:

Code:

. l, sep(0)

     +--------------------------+
     |              companyname |
     |--------------------------|
  1. |  china oriental group    |
  2. |    west fraser timber    |
  3. |    west fraser timber    |
  4. |    west fraser timber    |
  5. |    west fraser timber    |
  6. |    west fraser timber    |
  7. |    west fraser timber    |
  8. |    west fraser timber    |
  9. |    west fraser timber    |
 10. |    west fraser timber    |
     +--------------------------+

Comment

Christian Spek

Join Date: Aug 2023

Posts: 8
#3

16 Aug 2023, 07:27

Andrew Musau Thank you for your response. Do you also have a suggestion that adds to the code I had above? Perhaps a loop or some sort that looks at every last word of the string?
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10069

16 Aug 2023, 07:32

Do you just want a list of words that appear at the end of each observation? What does #2 not accomplish?

Code:

sysuse auto, clear
list make in 1/10, clean
gen keywords= word(make, -1)
levelsof keywords in 1/10, local(keywords) sep(|) clean
display "`keywords'"

Res.:

Code:

. list make in 1/10, clean

       make           
  1.   AMC Concord    
  2.   AMC Pacer      
  3.   AMC Spirit     
  4.   Buick Century  
  5.   Buick Electra  
  6.   Buick LeSabre  
  7.   Buick Opel     
  8.   Buick Regal    
  9.   Buick Riviera  
 10.   Buick Skylark  

. 
. gen keywords= word(make, -1)

. 
. levelsof keywords in 1/10, local(keywords) sep(|) clean
Century|Concord|Electra|LeSabre|Opel|Pacer|Regal|Riviera|Skylark|Spirit

. 
. display "`keywords'"
Century|Concord|Electra|LeSabre|Opel|Pacer|Regal|Riviera|Skylark|Spirit

Last edited by Andrew Musau; 16 Aug 2023, 07:38.

Comment

Christian Spek

Join Date: Aug 2023

Posts: 8
#5

16 Aug 2023, 07:52

Dear Andrew,

The code you gave me also takes away terms at the beginning of the string, and adds a space, which is undesirable for me. What I would like is a code that looks at the last word/term in a given string, and deletes this word if it is included in the list of words I provide. The code I listed in the original post does this already, but does not look again after deleting the first term.

I hope this clarifies.

With kind regards,

Christian Spek
Comment
Christian Spek

Join Date: Aug 2023

Posts: 8
#6

16 Aug 2023, 08:10

The original code I used was suggested by Clyde Schechter in this thread: https://www.statalist.org/forums/for...nd-from-string. Hopefully someone can add to it.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10069
#7

16 Aug 2023, 08:19

Originally posted by Christian Spek View Post

The code you gave me also takes away terms at the beginning of the string, and adds a space, which is undesirable for me.

Provide a dataex example with words at the beginning deleted.
Comment

Christian Spek

Join Date: Aug 2023
Posts: 8

16 Aug 2023, 08:45

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str100 companyname str97 companyname_clean
`"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
`"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
`"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
`"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
`"Etablissements Delhaize Frères et Cie ""Le Lion"" (Groupe Delhaize) SA"' `" etablissements delhaize frères et cie ""le lion"" ( delhaize) sa "'
end

Not beginning of the word, but in the middle of the word, which I do not want. 'Groupe' is also one of my words.

Last edited by Christian Spek; 16 Aug 2023, 08:47.

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10069

16 Aug 2023, 09:28

I do not see the word "Groupe" repeated. "Delhaize" is though. For the modified example below, here is how you can eliminate the word only if it appears at the end of the string.

Code:

help strtrim()

on space management.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str100 companyname
`"Etablissements Groupe Frères et Cie ""Le Lion"" Groupe"'
`"Etablissements Delhaize Frères et Cie ""Le Lion"" Delhaize"'
end

replace companyname= trim(ustrregexra(" "+ companyname+ " ", `"[\s\b](Delhaize|Groupe)[\b\s]$"', ""))

Res.:

Code:

. l

     +---------------------------------------------------+
     |                                       companyname |
     |---------------------------------------------------|
  1. |   Etablissements Groupe Frères et Cie ""Le Lion"" |
  2. | Etablissements Delhaize Frères et Cie ""Le Lion"" |
     +---------------------------------------------------+

Announcement