Generate new grouping variable using two string variables of non-matching structure requiring collapse

Michael Middletown

Join Date: Mar 2022
Posts: 10

Generate new grouping variable using two string variables of non-matching structure requiring collapse

21 Mar 2022, 07:39

Good morning all,

My first STATA post, so apologies if I have structured this request for help incorrectly!

I am working with a reasonably large dataset (500,000 observations) in Stata/IC 15.

I am trying to generate a new grouping variable based on the content of 2 string variables, with the view to collapsing sub-families etc.

Example data

String variable 1	String variable 2	Desired variable
Catalogue number	Product description	Product brand alone

001-456-789	Brand 1, Size 2, Colour 3	Brand 1
001-456-123	Brand 1, Size 2, Colour 7	Brand 1
001-246-789	Brand 1, Size 8, Colour 3	Brand 1

ABCDE123XYZ	Brand 2, Size 4, Colour 2	Brand 2
ABCDE456XYZ	Brand 2, Size 6, Colour 2	Brand 2
ABCDE123XYZ	Brand 2, Size 4, Colour 2	Brand 2
ABCDE289GBH	Brand 2, Size 7, Colour 3	Brand 2

Xyy289GHH	Brand 3, Colour 3, Size 1	Brand 3
Xyy289BNH	Brand 3, Colour 3, Size 3	Brand 3
Xyy567JHJ	Brand 3, Colour 4, Size 4	Brand 3

*Colors added only to highlight that catalogue number data structure can represent different product features dependent on manufacturer. Total length of this varies (as does proportion of string length per characteristic), as well as some being numeric and some alphanumeric.

There are ~1500 unique catalogue numbers/descriptions with frequencies ranging from 1 to ~70,000 (a few variables represent the majority of cases etc).

I have been unable to determine a method to loop across these variables and define them by product brand alone.

The closest I have achieved is manually sorting on product description, comparing the catalogue number and product description, and then using strmatch to create a flag to then generate the product brand variable based on the common structure for that particular brand (copied below):

This would however mean manually going through the below for every one of the ~1500 possible brands (a thought I do not relish).

Code:

//compare count of cat numbers vs Brand 1 Size(x) Colour (y)
//useful for manual checking of smaller groups e.g. visual inspection of data shows there are 20 observations of this brand, if strmatch returns 20, unlikely will incorrectly code other brands as Brand 1
//this not used in generating productbrand field, purely a manual check
count if strmatch(cataloguenumber, "001-???-???")==1

//create variable to code brand
//labels to be modified on rolling basis of each brand
gen productbrand=.
label define brands 1 "Brand1"
label values productbrand brands

//brandstemp used for each strmatch loop
gen brandstemp=.

//flags if cataloguenumber belongs to 'common' group/structure for that brand, flags temp as 1
replace brandstemp = strmatch(cataloguenumber, "001-???-???")

//create flag for Brand 1
replace productbrand=1 if brandstemp==1

//reset temp field for next grouping
replace brandstemp=.

I have been unable to find a way to loop the above process.

My other concern is that human error could result in faults - the above requires manually updating the labels list by 1 each time, so that productbrand ends up as 1=Brand1, 2=Brand2, 3=Brand3 etc.

My suspicion is that there is no way to loop or automate the above process, due to the lack of consistent data structure in the catalogue numbers etc. Having spent some time trying to find a way around this, I wanted to ask the STATA community if my suspicions are correct before spending more hours trying to solve an unsolvable problem! I have tried the STATA help and forums, but did not find a solution.

Many thanks in advance,

Michael

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

21 Mar 2022, 09:15

I'm very confused by your post. Your tableau, as I see it, says that extracting the brand from the product description variable requires nothing more than isolating the substring that precedes the first comma, and has nothing whatsoever to do with the catalogue number. I think that the matter might be better understood if you show actual example data, from your Stata data set, using the -dataex- command.

If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

Michael Middletown

Join Date: Mar 2022
Posts: 10

21 Mar 2022, 10:35

Good afternoon Clyde

Many thanks for your reply. Apologies, -dataex- output below. Good example of why my manual table was unhelpful - the real data is not as structured as I made out above. There are no limiters in the product description, nor are they of the same length or word order between each observation. Similarly the catalogue numbers can vary in length and alphanumeric structure.

My logic to date was ordering data on productdescription, and manually reviewing the catno and productdescription together to identify the common feature of the catalogue number, to then create the grouping variable on this via strmatch. It is this step that makes me think it is something that needs to be done manually?

Regards,

Michael

Code:

clear
input str20 catno str99 productdescription
"201256394"  "Blueprint Clothing Green Shirt Small"             
"201256394"  "Blueprint Clothing Green Shirt Small"                    
"201256395"  "Blueprint Clothing Green Shirt Large   
"110011"     "Clothex Goretex Shirt Large Green"                      
"110011"     "Clothex Goretex Shirt Large Green"
"110011"     "Clothex Goretex Shirt Large Green"                                                                                
"4224-11-12" "Exhibit Wear Tee Maroon Large"            
"4224-11-12" "Exhibit Wear Tee Maroon Large"
"4224-11-12" "Exhibit Wear Tee Maroon Large"
"4224-16-20" "Exhibit Wear Tee Green Small"
"4224-16-20" "Exhibit Wear Tee Green Small"
"4224-16-20" "Exhibit Wear Tee Green Small"
"4224-16-22" "Exhibit Wear Tee Green XSmall"                                                            
"9892-15-02" "Treez Shirt Navy Medium"                  
"9892-15-02" "Treez Shirt Navy Medium"
"9892-15-02" "Treez Shirt Navy Medium"                    
"4324G6512"  "QuadWare Short Khaki Medium"                 
"4324G6512"  "QuadWare Short Khaki Medium"                  
"4324G6512"  "QuadWare Short Khaki Medium"                                      
"4324G6512"  "QuadWare Short Khaki Medium"                 
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

21 Mar 2022, 12:55

So you are really trying to identify the common substring in the catalog number among various products whose product description suggests they are the same brand. I have to agree, that this is not the kind of task that is well-suited to Stata. I don't want to categorically say it can't be done, but I can't think of any way to do it. It sounds like a task that would be better suited to some kind of text-processing artificial intelligence software.
Comment
Michael Middletown

Join Date: Mar 2022

Posts: 10
#5

22 Mar 2022, 03:32

Good morning Clyde,

Many thanks for your advice. That was my concern as well, so will look at alternative solutions!

Thanks again

Michael
Comment

Announcement

Generate new grouping variable using two string variables of non-matching structure requiring collapse

Comment

Comment

Comment

Comment