Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating new variables based on string categories in other variables

    I'm trying to clean and restructure a poorly designed dataset on industry and sector classifications. I was provided with a spreadsheet of businesses, one per row, which could have selected any number of 24 main industry categories and 112 subcategories. I want to create 24 binary main industry categories and 112 subcategories. .
    The data provided was a spreadsheet where all industry selections were presented in a single cell for main category, and a single cell for subcategory, separated by commas. I have separated them out using the comma delimiter. This created 6 main category variables, but with industries randomly ordered/distributed throughout the new variables (i.e. the businesses selected up to six main categories) and 17 subcategory variables, again with the names of the subcategories randomly ordered and distributed across the 17 variables.
    These variables are currently mcat1-6 and scat1-17, and are string variables, where the values are the name of an industry selected by the business, e.g. "Accommodation Food Services Travel and Tourism". Where the industry selected fewer than 6 (or 17) categories, there is just blanks.

    I want to generate one variable for each industry category, e.g. for "Accommodation Food Services Travel and Tourism" I had hoped I could use my usual approach with a numeric variable but using quotes:

    gen afstt=0
    replace afstt=1 if ///
    mcat1=="Accommodation Food Services Travel and Tourism" | ///
    mcat2=="Accommodation Food Services Travel and Tourism" | ///
    mcat3=="Accommodation Food Services Travel and Tourism" | ///
    mcat4=="Accommodation Food Services Travel and Tourism" | ///
    mcat5=="Accommodation Food Services Travel and Tourism" | ///
    mcat6=="Accommodation Food Services Travel and Tourism"

    This runs, but creates zero changes. I guess the 'generate' syntax is not recognising the string content of the variables?

    I did not use encode because that results in different numeric codes for the same industry across the 23 different mcat/scat variables, which would make it even more complicated to create the variables.
    I'm seeking a way to generate the variables based on the actual string content across different variables, hopefully someone has some good suggestions?
    I have looked up the existing discussions but haven't found anything that specifically answers my question. Thanks in advance!
    Last edited by Zora Ng; 21 Jan 2025, 21:53.

  • #2
    Your description of your problem and the structure of your data is difficult to understand. (Having a colleague look at a description before posting is always a good idea.) Moreover, while I think I understand the general kind of problem you face, giving you precise and relevant help would be much easier if, per the StataList FAQ, you had offered a data example.

    Here's a sketch of the kind of thing that *might* be relevant for your situation:
    Code:
    local industrycat1 = "Some Distinctive Text"
    local industrycat2 = "Some Kind of Text"
    ...
    local industrycat24 = "Some Other Text"
    //
    forval i = 1/24 {
       gen byte industry`i' = strpos(YourMainCategoryVariable, "`industry`i'") > 0
    }
    A similar kind of approach might work for your subcategories.

    I'm not going to offer any more precise detail here, as doing so efficiently and accurately would depend on seeing an actual example of your data. At worst, what I have suggested may be quite irrelevant or make quite erroneous assumptions about your data.

    Comment


    • #3
      I agree with Mike Lacy I offer only the small point that your displayed code could be trimmed to

      Code:
      gen afstt = inlist("Accommodation Food Services Travel and Tourism", mcat1, mcat2, mcat3, mcat4, mcat5, mcat6)
      although as Mike points out it seems more likely that you need strpos().

      See also https://journals.sagepub.com/doi/pdf...867X1101100308

      Comment

      Working...
      X