Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting the number of unique and distinct words occurring within a string variable

    In the following simulated data (full code provided) I am seeking to count the number of unique and distinct words occurring within one string variable, -allcolors-. I have read the FAQ at:
    http://www.stata.com/support/faqs/da...stinct-values/
    however that deals with one-word stringvars, whereas I wish to count the number of unique and distinct words with a stringvar that has multiple words.

    In this simulation I have manually counted, in the var -total-, the number I'm seeking advice on how to write code for counting.

    Code:
            clear
            set obs 8
            input group id visitno str25 allcolors total
                1    11 1 "Red" 1
                1    24 1 "Red" 1
                1    24 2 "Red Blue" 2
                2    18 1 "Red" 1
                2    18 2 "Red Blue" 2
                2    18 3 "Red Blue Green Yellow" 4
                2    44 1 "Red" 1
                2    44    2 "Red Blue" 2 end
            l, noo sepby(group)

  • #2
    try this:
    Code:
    gen count=wordcount(allcolors)
    the above works for your examples; however, if you have, e.g., "red red" and you want that to be 1, then the above will not work

    Comment


    • #3
      Thanks Rich for the very helpful suggestion. Prior to running your code, I will use the following code to remove duplicate words within the string we're discussing. Is it your opinion that my code below would correctly remove any duplicate words within each string record?

      Code:
                      quietly    forval n = 1/`=_N' {
                      loc t `"`=allcolors[`n']'"'
                      loc t2 : list uniq t
                       replace allcolors = `"`: list uniq t'"' in `n'
                      }

      Comment


      • #4
        See also http://www.stata-journal.com/sjpdf.h...iclenum=pr0046 from 2009, section 7, and rowsvals() from egenmore (SSC). That presupposes working on several variables rowwise, but split gets you there.

        Comment


        • #5
          Thanks Nick for referring me to the Rowwise article; looks like some very useful and interesting reading.

          Comment

          Working...
          X