Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • finding keywords in string variables

    Dear Statalister

    I'm working on a dataset and I need to create a dummy variable equals 1 if the string variable contain words or phrases from a list of keywords. I'm using the following code:

    global keyword "AA BB CC"

    foreach i of global keyword3{

    gen Dummy_`i'= strpos(string, "`i'") > 0
    }

    This code works find if the keyword is a single word. However if it is a phrase, such as "low AA" or "one AA two" or "AA three four" then the code will create different variables for each word in the phrase.

    The list of keywords are very long so I would prefer to use loop function for it.
    I would highly appreciate if you can please help me to come up with a way to deal with this issue.

    Thank you very much in advanced
    Last edited by Mia Pham; 28 Oct 2021, 17:34.

  • #2
    See #3 https://www.statalist.org/forums/for...g-observations

    Comment


    • #3
      Thank you very much Andrew. Your post provides great example for dealing with extracting words/phrases in string variables. However the main problem that I have now is how to do use loop to extract the words/phrases because the lists can be very long. The key words and phrases are also very random and don't follow any patterns. Can you please give me some hints? Thank you

      Kind regards,
      Mia

      Comment


      • #4
        Note the lower-case input of the keyword phrases.

        Code:
        input strL tagline
        "Cable-free live TV is here. You Tube TV"
        "Join a better network! Because better matters. Verizon"
        "COLONEL QUALITY GUARANTEED. KFC"
        "Goodyear, more driven."
        "15 minutes could save you 15% or more on car insurance.GEICO"
        "All the News That's Fit to Print. NYT"
        "America Runs on Dunkin'. Dunkin' Donuts"
        "Imagination at Work. GE"
        "CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco"
        end
        
        
        local keywords `" "live tv" "quality guaranteed" "three shredded wheat" "'
        local i 1
        foreach k of local keywords{
            gen match`i' = regexm(" " + lower(tagline) + " ", "['!?,\. ](`k')['!?,\. ]")
            local ++i
        }
        Res.:

        Code:
        . l, sep(0)
        
             +-----------------------------------------------------------------------------------------+
             |                                                      tagline   match1   match2   match3 |
             |-----------------------------------------------------------------------------------------|
          1. |                      Cable-free live TV is here. You Tube TV        1        0        0 |
          2. |       Join a better network! Because better matters. Verizon        0        0        0 |
          3. |                              COLONEL QUALITY GUARANTEED. KFC        0        1        0 |
          4. |                                       Goodyear, more driven.        0        0        0 |
          5. | 15 minutes could save you 15% or more on car insurance.GEICO        0        0        0 |
          6. |                        All the News That's Fit to Print. NYT        0        0        0 |
          7. |                      America Runs on Dunkin'. Dunkin' Donuts        0        0        0 |
          8. |                                      Imagination at Work. GE        0        0        0 |
          9. |                  CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco        0        0        1 |
             +-----------------------------------------------------------------------------------------+
        Last edited by Andrew Musau; 28 Oct 2021, 19:09.

        Comment


        • #5
          Thank you very much Andrew. I got it now. This works perfectly.

          Kind regards,
          Mia

          Comment


          • #6
            Dear Andrew

            Thank you again for the code.
            I just have another question. How would you deal with the cases where the keyword is between brackets or quotation marks (such as in the below example)?

            The keywords are the same: "live tv" "quality guaranteed" "three shredded wheat"


            input strL tagline
            "Cable-free (live TV) is here. You Tube TV"
            "Join a better network! Because better matters. Verizon"
            "COLONEL "QUALITY GUARANTEED". KFC"
            "Goodyear, more driven."
            "15 minutes could save you 15% or more on car insurance.GEICO"
            "All the News That's Fit to Print. NYT"
            "America Runs on Dunkin'. Dunkin' Donuts"
            "Imagination at Work. GE"
            "CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco"
            end


            Thank you very much
            Last edited by Mia Pham; 03 Nov 2021, 22:31.

            Comment


            • #7
              Include parentheses and double quotes among the excluded characters. Stata's parsing behavior in the case of double quotes if applying regex functions is illustrated in my reply #14 in https://www.statalist.org/forums/for...-within-quotes. All in all:

              Code:
              input strL tagline
              "Cable-free (live TV) is here. You Tube TV"
              "Join a better network! Because better matters. Verizon"
              `"COLONEL "QUALITY GUARANTEED". KFC"'
              "Goodyear, more driven."
              "15 minutes could save you 15% or more on car insurance.GEICO"
              "All the News That's Fit to Print. NYT"
              "America Runs on Dunkin'. Dunkin' Donuts"
              "Imagination at Work. GE"
              "CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco"
              end
              
              local keywords `" "live tv" "quality guaranteed" "three shredded wheat" "'
              local i 1
              foreach k of local keywords{
                  gen match`i' = regexm(" " + lower(tagline) + " ", `"['!?,\.\(\) " ](`k')['!?,\.\(\) " ]"')
                  local ++i
              }
              Res.:

              Code:
              . l
              
                   +-----------------------------------------------------------------------------------------+
                   |                                                      tagline   match1   match2   match3 |
                   |-----------------------------------------------------------------------------------------|
                1. |                    Cable-free (live TV) is here. You Tube TV        1        0        0 |
                2. |       Join a better network! Because better matters. Verizon        0        0        0 |
                3. |                            COLONEL "QUALITY GUARANTEED". KFC        0        1        0 |
                4. |                                       Goodyear, more driven.        0        0        0 |
                5. | 15 minutes could save you 15% or more on car insurance.GEICO        0        0        0 |
                   |-----------------------------------------------------------------------------------------|
                6. |                        All the News That's Fit to Print. NYT        0        0        0 |
                7. |                      America Runs on Dunkin'. Dunkin' Donuts        0        0        0 |
                8. |                                      Imagination at Work. GE        0        0        0 |
                9. |                  CAN YOU EAT THREE SHREDDED WHEAT? – Nabisco        0        0        1 |
                   +-----------------------------------------------------------------------------------------+
              Last edited by Andrew Musau; 04 Nov 2021, 07:24.

              Comment


              • #8
                Thank you very much. I have learned a lot from you and the other posts that you mentioned.

                Kind regards,
                Mia

                Comment

                Working...
                X