I'm trying to clean and restructure a poorly designed dataset on industry and sector classifications. I was provided with a spreadsheet of businesses, one per row, which could have selected any number of 24 main industry categories and 112 subcategories. I want to create 24 binary main industry categories and 112 subcategories. .
The data provided was a spreadsheet where all industry selections were presented in a single cell for main category, and a single cell for subcategory, separated by commas. I have separated them out using the comma delimiter. This created 6 main category variables, but with industries randomly ordered/distributed throughout the new variables (i.e. the businesses selected up to six main categories) and 17 subcategory variables, again with the names of the subcategories randomly ordered and distributed across the 17 variables.
These variables are currently mcat1-6 and scat1-17, and are string variables, where the values are the name of an industry selected by the business, e.g. "Accommodation Food Services Travel and Tourism". Where the industry selected fewer than 6 (or 17) categories, there is just blanks.
I want to generate one variable for each industry category, e.g. for "Accommodation Food Services Travel and Tourism" I had hoped I could use my usual approach with a numeric variable but using quotes:
gen afstt=0
replace afstt=1 if ///
mcat1=="Accommodation Food Services Travel and Tourism" | ///
mcat2=="Accommodation Food Services Travel and Tourism" | ///
mcat3=="Accommodation Food Services Travel and Tourism" | ///
mcat4=="Accommodation Food Services Travel and Tourism" | ///
mcat5=="Accommodation Food Services Travel and Tourism" | ///
mcat6=="Accommodation Food Services Travel and Tourism"
This runs, but creates zero changes. I guess the 'generate' syntax is not recognising the string content of the variables?
I did not use encode because that results in different numeric codes for the same industry across the 23 different mcat/scat variables, which would make it even more complicated to create the variables.
I'm seeking a way to generate the variables based on the actual string content across different variables, hopefully someone has some good suggestions?
I have looked up the existing discussions but haven't found anything that specifically answers my question. Thanks in advance!
The data provided was a spreadsheet where all industry selections were presented in a single cell for main category, and a single cell for subcategory, separated by commas. I have separated them out using the comma delimiter. This created 6 main category variables, but with industries randomly ordered/distributed throughout the new variables (i.e. the businesses selected up to six main categories) and 17 subcategory variables, again with the names of the subcategories randomly ordered and distributed across the 17 variables.
These variables are currently mcat1-6 and scat1-17, and are string variables, where the values are the name of an industry selected by the business, e.g. "Accommodation Food Services Travel and Tourism". Where the industry selected fewer than 6 (or 17) categories, there is just blanks.
I want to generate one variable for each industry category, e.g. for "Accommodation Food Services Travel and Tourism" I had hoped I could use my usual approach with a numeric variable but using quotes:
gen afstt=0
replace afstt=1 if ///
mcat1=="Accommodation Food Services Travel and Tourism" | ///
mcat2=="Accommodation Food Services Travel and Tourism" | ///
mcat3=="Accommodation Food Services Travel and Tourism" | ///
mcat4=="Accommodation Food Services Travel and Tourism" | ///
mcat5=="Accommodation Food Services Travel and Tourism" | ///
mcat6=="Accommodation Food Services Travel and Tourism"
This runs, but creates zero changes. I guess the 'generate' syntax is not recognising the string content of the variables?
I did not use encode because that results in different numeric codes for the same industry across the 23 different mcat/scat variables, which would make it even more complicated to create the variables.
I'm seeking a way to generate the variables based on the actual string content across different variables, hopefully someone has some good suggestions?
I have looked up the existing discussions but haven't found anything that specifically answers my question. Thanks in advance!
Comment