Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep variables if they contain at least one word of a given list

    Hello Stata Community

    I have a very big number of observations and I want to filter out the ones which contain at least one of the 20 key words I have. The problem is, that the observations are sentences and not just one word.

    I have tried the command:

    keep if inlist(Resolution,"drill" , "dioxin" , "clean up" , "nuclear" , "environment" , "environmental" , "pollution" , "energy" , "power" , "chlorine" , "trees" , "GHG" , "emissions" , "forest" , "recycling" , "recycled" , "mercury" , "water" , "filter" , "gene-engineered" , "mining" , "PVC" , "old growth wood" , "waste" , "paper" , "radioactive" , "toxic" , "plutonium" , "renewable" , "greenhouse gas" , "climate" , "CO2" , "parabens" , "phthalates")

    but there is always an error saying that the expression is to long.

    Do you have an idea, how I could do this? Thank you very much for your help.

  • #2
    as the help says, for strings there can be no more than 10 arguments - so break what you are doing into several "inlists" with an "or" (|) between each pair of lists

    Comment


    • #3
      Rich Goldstein is bang on, but see also https://www.stata.com/support/faqs/d...s-for-subsets/ for another way to do it.

      On a different level, note that your inlist() call is not a test for "contains". It is a test for "equals".

      Comment


      • #4
        Lisa, you might look at the examples here, here, and here.

        Picking an example from the first link:

        Code:
        dataex text
        clear
        input str47 text
        "The speaker occasionally referred to his notes" 
        "The speaker often referred to his notes"        
        "The speaker frequently referred to his notes"   
        "The speaker occasionelly referred to his notes" 
        "The speaker occasionally referred to his notes" 
        "The speaker ocasionally referred to his notes"  
        "The speaker occasionaly referred to his notes"  
        "The speaker occassionally referred to his notes"
        "The speaker occasionnally referred to his notes"
        end
        
        gen has_word=0
        
        foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
        replace has_word=1 if strpos(text, "`word'") > 0
        }
        
        *** To make the above loop case insensitive
        foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
        replace has_word=1 if strpos(strupper(text), strupper("`word'")) > 0 
        }
        
        . list, noobs
        
          +------------------------------------------------------------+
          |                                            text   has_word |
          |------------------------------------------------------------|
          |  The speaker occasionally referred to his notes          1 |
          |         The speaker often referred to his notes          0 |
          |    The speaker frequently referred to his notes          0 |
          |  The speaker occasionelly referred to his notes          1 |
          |  The speaker occasionally referred to his notes          1 |
          |------------------------------------------------------------|
          |   The speaker ocasionally referred to his notes          1 |
          |   The speaker occasionaly referred to his notes          1 |
          | The speaker occassionally referred to his notes          0 |
          | The speaker occasionnally referred to his notes          0 |
          +------------------------------------------------------------+

        Comment


        • #5
          Originally posted by David Benson View Post
          Lisa, you might look at the examples here, here, and here.

          Picking an example from the first link:

          Code:
          dataex text
          clear
          input str47 text
          "The speaker occasionally referred to his notes"
          "The speaker often referred to his notes"
          "The speaker frequently referred to his notes"
          "The speaker occasionelly referred to his notes"
          "The speaker occasionally referred to his notes"
          "The speaker ocasionally referred to his notes"
          "The speaker occasionaly referred to his notes"
          "The speaker occassionally referred to his notes"
          "The speaker occasionnally referred to his notes"
          end
          
          gen has_word=0
          
          foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
          replace has_word=1 if strpos(text, "`word'") > 0
          }
          
          *** To make the above loop case insensitive
          foreach word in occasionaly occasionelly occasionally ocasionally occasionally Occasionally {
          replace has_word=1 if strpos(strupper(text), strupper("`word'")) > 0
          }
          
          . list, noobs
          
          +------------------------------------------------------------+
          | text has_word |
          |------------------------------------------------------------|
          | The speaker occasionally referred to his notes 1 |
          | The speaker often referred to his notes 0 |
          | The speaker frequently referred to his notes 0 |
          | The speaker occasionelly referred to his notes 1 |
          | The speaker occasionally referred to his notes 1 |
          |------------------------------------------------------------|
          | The speaker ocasionally referred to his notes 1 |
          | The speaker occasionaly referred to his notes 1 |
          | The speaker occassionally referred to his notes 0 |
          | The speaker occasionnally referred to his notes 0 |
          +------------------------------------------------------------+
          Thank you so much David. This worked perfectly!

          Comment

          Working...
          X