Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to create binary variables from words in phrases?

    I have a string variable that, for each person, contains a short phrase, for example: "new video post." For each word in each phrase, I need to create a binary variable that indicates that the person's phrase contained (1) or didn't contain (0) the word. The names of the binary variables would be the names of the words. Can assume that the words in the phrase are all cleanly separated by a blank space.

    So, for this person, new=1, video=1, post=1, and all other binary variables=0. If there were, say, 1000 unique words over all the phrases from all the people, there would be a total of 1000 binary variables, each corresponding to a unique word.

    Trying to figure out the most efficient way to do this in Stata.

  • #2
    Code:
    gen new = 1 if strpos(yourstringvariable,"new")
    to further understand the mechanics or find other possibilities, have a look at the string functions, here, in particular strpos

    Comment


    • #3
      ...or, to generate a true binary 1/0 variable:
      Code:
      gen new = strpos(yourstringvariable, "new")>0

      Comment


      • #4
        Are you asking how to get the list of variable names if you don't already know what unique words exist in all of the phrases from all of the people? If so, then I believe that you will be better off with two passes through the data. You could consider something along the lines of the toy example below. Start at the "Begin here" comment. (The typographical error was unintended, but it does illustrate the kinds of problems that you're likely to encounter.)

        .ÿversionÿ14.0

        .ÿ
        .ÿclearÿ*

        .ÿsetÿmoreÿoff

        .ÿ
        .ÿinputÿstr6ÿ(firstÿsecondÿthird)

        ÿÿÿÿÿÿÿÿÿfirstÿÿÿÿÿsecondÿÿÿÿÿÿthird
        ÿÿ1.ÿnewÿvideoÿpost
        ÿÿ2.ÿoldÿvideoÿpost
        ÿÿ3.ÿnewÿaudioÿpost
        ÿÿ4.ÿoldÿaudioÿpost
        ÿÿ5.ÿnewÿvideoÿthread
        ÿÿ6.ÿoldÿvideoÿthreat
        ÿÿ7.ÿnewÿaudioÿthread
        ÿÿ8.ÿoldÿaudioÿthread
        ÿÿ9.ÿend

        .ÿgenerateÿstrÿtextÿ=ÿfirstÿ+ÿ"ÿ"ÿ+ÿsecondÿ+ÿ"ÿ"ÿ+ÿthird

        .ÿdropÿfirst-third

        .ÿ
        .ÿ*
        .ÿ*ÿBeginÿhere
        .ÿ*
        .ÿ
        .ÿ//ÿFirstÿpassÿthroughÿdataÿtoÿgenerateÿvariables
        .ÿforvaluesÿiÿ=ÿ1/`=_N'ÿ{
        ÿÿ2.ÿÿÿÿÿÿÿÿÿlocalÿobservation_textÿ=ÿtext[`i']
        ÿÿ3.ÿÿÿÿÿÿÿÿÿlocalÿword_listÿ`word_list'ÿ`observation_text'
        ÿÿ4.ÿÿÿÿÿÿÿÿÿlocalÿword_listÿ:ÿlistÿuniqÿword_list
        ÿÿ5.ÿ}

        .ÿ
        .ÿlocalÿvariable_tallyÿ:ÿlistÿsizeofÿword_list

        .ÿforvaluesÿiÿ=ÿ1/`variable_tally'ÿ{
        ÿÿ2.ÿÿÿÿÿÿÿÿÿlocalÿvariable_nameÿ:ÿwordÿ`i'ÿofÿ`word_list'
        ÿÿ3.ÿÿÿÿÿÿÿÿÿgenerateÿbyteÿ`variable_name'ÿ=ÿ0
        ÿÿ4.ÿ}

        .ÿ
        .ÿ//ÿSecondÿpassÿthroughÿdataÿtoÿsetÿindicatorÿvariables
        .ÿforeachÿvarÿofÿvarlistÿ`word_list'ÿ{
        ÿÿ2.ÿÿÿÿÿÿÿÿÿquietlyÿreplaceÿ`var'ÿ=ÿstrpos(text,ÿ"`var'")ÿ>ÿ0
        ÿÿ3.ÿ}

        .ÿ
        .ÿlist,ÿnoobsÿseparator(0)ÿabbreviate(20)

        ÿÿ+-----------------------------------------------------------------------+
        ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿtextÿÿÿnewÿÿÿvideoÿÿÿpostÿÿÿoldÿÿÿaudioÿÿÿthreadÿÿÿthreatÿ|
        ÿÿ|-----------------------------------------------------------------------|
        ÿÿ|ÿÿÿnewÿvideoÿpostÿÿÿÿÿ1ÿÿÿÿÿÿÿ1ÿÿÿÿÿÿ1ÿÿÿÿÿ0ÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ0ÿ|
        ÿÿ|ÿÿÿoldÿvideoÿpostÿÿÿÿÿ0ÿÿÿÿÿÿÿ1ÿÿÿÿÿÿ1ÿÿÿÿÿ1ÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ0ÿ|
        ÿÿ|ÿÿÿnewÿaudioÿpostÿÿÿÿÿ1ÿÿÿÿÿÿÿ0ÿÿÿÿÿÿ1ÿÿÿÿÿ0ÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ0ÿ|
        ÿÿ|ÿÿÿoldÿaudioÿpostÿÿÿÿÿ0ÿÿÿÿÿÿÿ0ÿÿÿÿÿÿ1ÿÿÿÿÿ1ÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ0ÿ|
        ÿÿ|ÿnewÿvideoÿthreadÿÿÿÿÿ1ÿÿÿÿÿÿÿ1ÿÿÿÿÿÿ0ÿÿÿÿÿ0ÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ0ÿ|
        ÿÿ|ÿoldÿvideoÿthreatÿÿÿÿÿ0ÿÿÿÿÿÿÿ1ÿÿÿÿÿÿ0ÿÿÿÿÿ1ÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿ1ÿ|
        ÿÿ|ÿnewÿaudioÿthreadÿÿÿÿÿ1ÿÿÿÿÿÿÿ0ÿÿÿÿÿÿ0ÿÿÿÿÿ0ÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ0ÿ|
        ÿÿ|ÿoldÿaudioÿthreadÿÿÿÿÿ0ÿÿÿÿÿÿÿ0ÿÿÿÿÿÿ0ÿÿÿÿÿ1ÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ1ÿÿÿÿÿÿÿÿ0ÿ|
        ÿÿ+-----------------------------------------------------------------------+

        .ÿ
        .ÿexit

        endÿofÿdo-file


        .


        It probably won't be a problem if you're talking about only a thousand short words, but watch your macro string length limit, especially if you're not using Stata SE or MP. See the documentation at help limits and help maxvar.

        Also, if you're expecting many people to use the exact same phrase, then you'll gain efficiency by doing the following at the beginning of the first pass:
        Code:
        contract text, freq(person_tally)
        and taking the tally into account when assessing the indicator variables' contents.

        Comment


        • #5
          Here's another approach

          Code:
          clear
          input byte id str17 phrase
          1 "new video post" 
          2 "newer tweet" 
          3 "add 12 new photos" 
          4 "removed 1 photo" 
          end
          format %-17s phrase
          
          split phrase
          local nwords = r(nvars)
          forvalues i = 1/`nwords' {
              
              levelsof phrase`i', clean
              foreach word in `r(levels)' {
                  // in case a word is not a valid Stata name
                  local vname = strtoname("`word'")
                  // ignore error if var already exists
                  cap gen byte `vname' = 0
                  // look for space delimited words
                  qui replace `vname' = 1 if strpos(" " + phrase`i' + " "," `word' ")
              }
              
          }

          Comment


          • #6
            If you have a lot of cases, words or you want frequencies, try the program precoin written in Mata:
            Code:
            clear
            input byte id str17 phrase
            1 "new video post" 
            2 "newer tweet" 
            3 "add 12 new photos" 
            4 "removed 1 photo" 
            end
            net install coin, from(http://sociocav.usal.es/stata)
            precoin phrase, stub(labels) f sep(" ") replace
            More about precoin, type help precoin once installed.

            Comment


            • #7
              Dear all,

              I have a list of codes (e.g., C03 CFD HHTRDE) and I want to create a variable Y that is equal to 1 if a string variable X has any of these codes, but the codes need to start the variable x.

              I tried the syntax you suggested in the post, but it captures the code in any part of the word.

              Here is what I need:

              If x=C03T0F; y=1
              If x=XTSC03; y=0
              if x=XTC03N; y=0

              I would appreciate it if you could help me.

              Thanks in advance for your help


              Comment

              Working...
              X