Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating variable from searching multiple strings

    Hi there - probably an easy question for most of you, but I'm new to STATA and can't find the answer to this.
    I am trying to create a new variable "ethnicity_Hisp" by searching seven other variables (ethnicity1; ethnicity2; etc) for the string "Hispanic or Latino".
    I've tried to do this a couple different ways - shortened to just two variables here for simplicity:
    gen variable ethnicity_Hisp=1 if ethnicity1 == "Hispanic or Latino" | ethnicity2 == "Hispanic or Latino"
    gen variable ethnicity_Hisp=1 if strpos(ethnicity1, "Hispanic or Latino) | strpos(ethnicity2, "Hispanic or Latino"

    I keep getting back "too many variables"
    Is there a way to do this without a multistep code where I generate the variable and then replace with 1 for each of the ethnicity variables?

    Thanks!

  • #2
    Welcome to the Stata Forum / Statalist

    To start, your error message is not related to the "string" issue.

    Please see the example below:

    Code:
    . gen variable myvar =1
    too many variables specified
    r(103);
    
    . gen  myvar =1
    In short, you are not supposed to add the term "variable" in the command line.

    That said, shall you have further difficulties, please read the FAQ. There you'll find how to share dataset\command\output.

    Hopefully that helps.
    Best regards,

    Marcos

    Comment


    • #3
      yikes, head smack on that one. Thank you!

      Comment


      • #4
        The word variable is out of place there. Stata thinks you are trying to give two names where only one is allowed.

        Stata is bailing out at that error and tells you nothing about other errors lurking thereafter.

        For two variables, this is a reasonable recipe

        Code:
        gen ethnicity_Hisp =  strpos(ethnicity1, "Hispanic or Latino") | strpos(ethnicity2, "Hispanic or Latino")
        You were missing a fair amount of punctuation there. That creates an indicator variable 1 or 0, much more useful than one that is 1 or missing.

        For seven variables, you need more tricks.


        Code:
        gen ethnicity_Hisp = 0
        
        forval j = 1/7 {
              replace ethnicity_Hisp = 1 if strpos(ethnicity`j', "Hispanic or Latino")
        }
        Note that the code here is utterly literal. "hispanic or latino" fails, for example.

        Further reading if needed:

        https://journals.sagepub.com/doi/ful...36867X19830921

        https://www.stata-journal.com/articl...article=pr0046

        Comment


        • #5
          Here is another trick I like.

          This works too and is less painful than a more obvious alternature.

          Code:
          gen wanted = inlist("Hispanic or Latino", ethnicity1, ethnicity2, ethnicity3, ethnicity4, ethnicity5, ethnicity6, ethnicity7)
          I think we are all trained from childhood to write

          if X = 1 or Y = 1 or Z = 1

          whereas

          if 1 = X or 1 = Y or 1 = Z

          is just another way to write the same thing, less conventional but no less logical, and inlist() captures the same idea.

          More about this in the first paper linked in #4.

          Comment

          Working...
          X