Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I create one binary variable that =1 if an observation contains one of multiple substrings?

    Hi, I want to create a binary variable that =1 if an observation within var1 contains certain substrings. It is survey data, and I am looking for a particular word, however, it is often misspelt, so I am looking to search for all possible spelling combinations.

    I am using:
    foreach word in xx xy xz {
    egen `word' = incss(var1), sub(`word') insensitive
    }

    Which works, but it creates a separate variable for each combination, I only want to create one variable for each variation of the spelling. Thanks.
    Last edited by Joe Mo; 01 Feb 2019, 14:48.

  • #2
    You might try:
    Code:
    gen has_word = 0
    
    foreach word in xx xy xz {
    replace has_word=1 if strpos(var1, "`word'") > 0
    }
    Code:
    dataex text
    clear
    input str47 text
    "The speaker occasionally referred to his notes" 
    "The speaker often referred to his notes"        
    "The speaker frequently referred to his notes"   
    "The speaker occasionelly referred to his notes" 
    "The speaker occasionally referred to his notes" 
    "The speaker ocasionally referred to his notes"  
    "The speaker occasionaly referred to his notes"  
    "The speaker occassionally referred to his notes"
    "The speaker occasionnally referred to his notes"
    end
    
    gen has_word=0
    
    foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
    replace has_word=1 if strpos(text, "`word'") > 0
    }
    
    . list, noobs
    
      +------------------------------------------------------------+
      |                                            text   has_word |
      |------------------------------------------------------------|
      |  The speaker occasionally referred to his notes          1 |
      |         The speaker often referred to his notes          0 |
      |    The speaker frequently referred to his notes          0 |
      |  The speaker occasionelly referred to his notes          1 |
      |  The speaker occasionally referred to his notes          1 |
      |------------------------------------------------------------|
      |   The speaker ocasionally referred to his notes          1 |
      |   The speaker occasionaly referred to his notes          1 |
      | The speaker occassionally referred to his notes          0 |
      | The speaker occasionnally referred to his notes          0 |
      +------------------------------------------------------------+

    Comment


    • #3
      Originally posted by David Benson View Post
      You might try:
      Code:
      gen has_word = 0
      
      foreach word in xx xy xz {
      replace has_word=1 if strpos(var1, "`word'") > 0
      }
      Code:
      dataex text
      clear
      input str47 text
      "The speaker occasionally referred to his notes"
      "The speaker often referred to his notes"
      "The speaker frequently referred to his notes"
      "The speaker occasionelly referred to his notes"
      "The speaker occasionally referred to his notes"
      "The speaker ocasionally referred to his notes"
      "The speaker occasionaly referred to his notes"
      "The speaker occassionally referred to his notes"
      "The speaker occasionnally referred to his notes"
      end
      
      gen has_word=0
      
      foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
      replace has_word=1 if strpos(text, "`word'") > 0
      }
      
      . list, noobs
      
      +------------------------------------------------------------+
      | text has_word |
      |------------------------------------------------------------|
      | The speaker occasionally referred to his notes 1 |
      | The speaker often referred to his notes 0 |
      | The speaker frequently referred to his notes 0 |
      | The speaker occasionelly referred to his notes 1 |
      | The speaker occasionally referred to his notes 1 |
      |------------------------------------------------------------|
      | The speaker ocasionally referred to his notes 1 |
      | The speaker occasionaly referred to his notes 1 |
      | The speaker occassionally referred to his notes 0 |
      | The speaker occasionnally referred to his notes 0 |
      +------------------------------------------------------------+



      That worked perfectly, thank you very much!

      Comment


      • #4
        Originally posted by David Benson View Post
        You might try:
        Code:
        gen has_word = 0
        
        foreach word in xx xy xz {
        replace has_word=1 if strpos(var1, "`word'") > 0
        }
        Code:
        dataex text
        clear
        input str47 text
        "The speaker occasionally referred to his notes"
        "The speaker often referred to his notes"
        "The speaker frequently referred to his notes"
        "The speaker occasionelly referred to his notes"
        "The speaker occasionally referred to his notes"
        "The speaker ocasionally referred to his notes"
        "The speaker occasionaly referred to his notes"
        "The speaker occassionally referred to his notes"
        "The speaker occasionnally referred to his notes"
        end
        
        gen has_word=0
        
        foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
        replace has_word=1 if strpos(text, "`word'") > 0
        }
        
        . list, noobs
        
        +------------------------------------------------------------+
        | text has_word |
        |------------------------------------------------------------|
        | The speaker occasionally referred to his notes 1 |
        | The speaker often referred to his notes 0 |
        | The speaker frequently referred to his notes 0 |
        | The speaker occasionelly referred to his notes 1 |
        | The speaker occasionally referred to his notes 1 |
        |------------------------------------------------------------|
        | The speaker ocasionally referred to his notes 1 |
        | The speaker occasionaly referred to his notes 1 |
        | The speaker occassionally referred to his notes 0 |
        | The speaker occasionnally referred to his notes 0 |
        +------------------------------------------------------------+
        Also, is there a way to make the command insensitive, or should I just include each spelling both capitalised and non-capitalised?

        Comment


        • #5
          You can use strupper() to convert both of them to uppercase.

          Code:
          * Change what's in the loop to:
          replace has_word=1 if strpos(strupper(text), strupper("`word'")) > 0

          Comment

          Working...
          X