Good morning all,
My first STATA post, so apologies if I have structured this request for help incorrectly!
I am working with a reasonably large dataset (500,000 observations) in Stata/IC 15.
I am trying to generate a new grouping variable based on the content of 2 string variables, with the view to collapsing sub-families etc.
Example data
There are ~1500 unique catalogue numbers/descriptions with frequencies ranging from 1 to ~70,000 (a few variables represent the majority of cases etc).
I have been unable to determine a method to loop across these variables and define them by product brand alone.
The closest I have achieved is manually sorting on product description, comparing the catalogue number and product description, and then using strmatch to create a flag to then generate the product brand variable based on the common structure for that particular brand (copied below):
This would however mean manually going through the below for every one of the ~1500 possible brands (a thought I do not relish).
I have been unable to find a way to loop the above process.
My other concern is that human error could result in faults - the above requires manually updating the labels list by 1 each time, so that productbrand ends up as 1=Brand1, 2=Brand2, 3=Brand3 etc.
My suspicion is that there is no way to loop or automate the above process, due to the lack of consistent data structure in the catalogue numbers etc. Having spent some time trying to find a way around this, I wanted to ask the STATA community if my suspicions are correct before spending more hours trying to solve an unsolvable problem! I have tried the STATA help and forums, but did not find a solution.
Many thanks in advance,
Michael
My first STATA post, so apologies if I have structured this request for help incorrectly!
I am working with a reasonably large dataset (500,000 observations) in Stata/IC 15.
I am trying to generate a new grouping variable based on the content of 2 string variables, with the view to collapsing sub-families etc.
Example data
String variable 1 | String variable 2 | Desired variable | ||
Catalogue number | Product description | Product brand alone | ||
001-456-789 | Brand 1, Size 2, Colour 3 | Brand 1 | ||
001-456-123 | Brand 1, Size 2, Colour 7 | Brand 1 | ||
001-246-789 | Brand 1, Size 8, Colour 3 | Brand 1 | ||
ABCDE123XYZ | Brand 2, Size 4, Colour 2 | Brand 2 | ||
ABCDE456XYZ | Brand 2, Size 6, Colour 2 | Brand 2 | ||
ABCDE123XYZ | Brand 2, Size 4, Colour 2 | Brand 2 | ||
ABCDE289GBH | Brand 2, Size 7, Colour 3 | Brand 2 | ||
Xyy289GHH | Brand 3, Colour 3, Size 1 | Brand 3 | ||
Xyy289BNH | Brand 3, Colour 3, Size 3 | Brand 3 | ||
Xyy567JHJ | Brand 3, Colour 4, Size 4 | Brand 3 |
*Colors added only to highlight that catalogue number data structure can represent different product features dependent on manufacturer. Total length of this varies (as does proportion of string length per characteristic), as well as some being numeric and some alphanumeric.
There are ~1500 unique catalogue numbers/descriptions with frequencies ranging from 1 to ~70,000 (a few variables represent the majority of cases etc).
I have been unable to determine a method to loop across these variables and define them by product brand alone.
The closest I have achieved is manually sorting on product description, comparing the catalogue number and product description, and then using strmatch to create a flag to then generate the product brand variable based on the common structure for that particular brand (copied below):
This would however mean manually going through the below for every one of the ~1500 possible brands (a thought I do not relish).
Code:
//compare count of cat numbers vs Brand 1 Size(x) Colour (y) //useful for manual checking of smaller groups e.g. visual inspection of data shows there are 20 observations of this brand, if strmatch returns 20, unlikely will incorrectly code other brands as Brand 1 //this not used in generating productbrand field, purely a manual check count if strmatch(cataloguenumber, "001-???-???")==1 //create variable to code brand //labels to be modified on rolling basis of each brand gen productbrand=. label define brands 1 "Brand1" label values productbrand brands //brandstemp used for each strmatch loop gen brandstemp=. //flags if cataloguenumber belongs to 'common' group/structure for that brand, flags temp as 1 replace brandstemp = strmatch(cataloguenumber, "001-???-???") //create flag for Brand 1 replace productbrand=1 if brandstemp==1 //reset temp field for next grouping replace brandstemp=.
My other concern is that human error could result in faults - the above requires manually updating the labels list by 1 each time, so that productbrand ends up as 1=Brand1, 2=Brand2, 3=Brand3 etc.
My suspicion is that there is no way to loop or automate the above process, due to the lack of consistent data structure in the catalogue numbers etc. Having spent some time trying to find a way around this, I wanted to ask the STATA community if my suspicions are correct before spending more hours trying to solve an unsolvable problem! I have tried the STATA help and forums, but did not find a solution.
Many thanks in advance,
Michael
Comment