Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Code Improvement

    Hi everyone,

    I have a question about online review analysis. I want to count how many reviews there are for each of the characters in each episode of a show. For instance, I have 5 characters whose names are Jack, Lisa, Kyle, Frank, and Mandy. I want to count in each episode, how many online reviews include the name "Jack" etc. The original example of the data structure is as below:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 show_name byte episode str27 reviews
    "a" 1 "Jack did well!"             
    "a" 1 "I like this one"            
    "a" 1 "Good"                       
    "a" 1 "What's this?"               
    "a" 1 "Lisa is angry loll"         
    "a" 1 "Tired of the show"          
    "a" 1 "Not as good as the last one"
    "a" 2 "Lisa killed"                
    "a" 2 "Kyle is upset"              
    "a" 2 "Jannifer looks good"        
    "a" 2 "Lisa looks young"           
    "a" 2 "Starving"                   
    "a" 2 "Kyle is back!"              
    "a" 2 "Like Jack "                 
    "a" 2 "Lisa!!!!"                   
    end

    I want to get a data structure as below in the end:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str1 show_name byte episode str5 id byte comment
    "a" 1 "Jack"  1
    "a" 2 "Jack"  1
    "a" 1 "Lisa"  1
    "a" 2 "Lisa"  3
    "a" 1 "Kyle"  0
    "a" 2 "Kyle"  2
    "a" 1 "Frank" 0
    "a" 2 "Frank" 0
    "a" 1 "Mandy" 0
    "a" 2 "Mandy" 0
    end
    Currently, the code that I'm using is as below, but I think it's not very efficient. I want to see if there is any way to further improve the efficiency of the code since I have many shows and each show has different characters, sometimes can be up to 20. It's very hard to code them manually.

    gen Count_Jack=0
    gen Count_Lisa=0
    gen Count_Kyle=0
    gen Count_Frank=0
    gen Count_Mandy=0

    replace Count_Jack=1 if ustrpos(reviews, "Jack")>0
    replace Count_Lisa=1 if ustrpos(reviews, "Lisa")>0
    replace Count_Kyle=1 if ustrpos(reviews, "Kyle")>0
    replace Count_Frank=1 if ustrpos(reviews, "Frank")>0
    replace Count_Mandy=1 if ustrpos(reviews, "Mandy")>0

    bysort episode: egen comment_Jack=total(Count_Jack)
    bysort episode: egen comment_Lisa=total(Count_Lisa)
    bysort episode: egen comment_Kyle=total(Count_Kyle)
    bysort episode: egen comment_Frank=total(Count_Frank)
    bysort episode: egen comment_Mandy=total(Count_Mandy)

    by episode, sort: gen nvals = _n == 1
    keep if nvals==1

    keep show_name episode comment*

    reshape long comment_, i(show_name episode) j(ID) string
    Please let me know if you have any thoughts. Thank you and look forward to your reply.



  • #2
    I'd convert the review to either all lower or upper case in case if anyone spelled names with irregular cases such as "JACK rocks!!" or "frank is awesome." The rest can be incorporated into a loop. To learn more, use command help foreach.

    Code:
    * Create a new one with lower case:
    gen low_review = lower(reviews)
    
    * Flag character
    foreach x in jack lisa kyle frank mandy{
        gen mention_`x' = ustrpos(low_review, "`x'") > 0
    }
    
    * Reshape to long
    gen id_entry = _n
    reshape long mention_, i(id_entry) j(actor, string)
    
    * Collapse
    collapse (sum) mention_, by(show_name episode actor)

    Comment


    • #3
      Maybe I've misunderstood, but wouldn't this work?
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input str1 show_name byte episode str5 id byte comment
      "a" 1 "Jack"  1
      "a" 2 "Jack"  1
      "a" 1 "Lisa"  1
      "a" 2 "Lisa"  3
      "a" 1 "Kyle"  0
      "a" 2 "Kyle"  2
      "a" 1 "Frank" 0
      "a" 2 "Frank" 0
      "a" 1 "Mandy" 0
      "a" 2 "Mandy" 0
      end
      
      set obs 11
      
      replace id = "Lisa" in 11
      
      replace episode = 2 in 11
      
      replace show = "a" in 11
      
      
      cls
      * Flag character
      
      bys episode id: egen mention = total(strpos(id, id) > 0)

      Comment


      • #4
        Originally posted by Ken Chui View Post
        I'd convert the review to either all lower or upper case in case if anyone spelled names with irregular cases such as "JACK rocks!!" or "frank is awesome." The rest can be incorporated into a loop. To learn more, use command help foreach.

        Code:
        * Create a new one with lower case:
        gen low_review = lower(reviews)
        
        * Flag character
        foreach x in jack lisa kyle frank mandy{
        gen mention_`x' = ustrpos(low_review, "`x'") > 0
        }
        
        * Reshape to long
        gen id_entry = _n
        reshape long mention_, i(id_entry) j(actor, string)
        
        * Collapse
        collapse (sum) mention_, by(show_name episode actor)
        Hi Ken,

        The code works very well for me! Saved lots of time. Thank you so much for your help!

        Comment


        • #5
          Originally posted by Jared Greathouse View Post
          Maybe I've misunderstood, but wouldn't this work?
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input str1 show_name byte episode str5 id byte comment
          "a" 1 "Jack" 1
          "a" 2 "Jack" 1
          "a" 1 "Lisa" 1
          "a" 2 "Lisa" 3
          "a" 1 "Kyle" 0
          "a" 2 "Kyle" 2
          "a" 1 "Frank" 0
          "a" 2 "Frank" 0
          "a" 1 "Mandy" 0
          "a" 2 "Mandy" 0
          end
          
          set obs 11
          
          replace id = "Lisa" in 11
          
          replace episode = 2 in 11
          
          replace show = "a" in 11
          
          
          cls
          * Flag character
          
          bys episode id: egen mention = total(strpos(id, id) > 0)
          Hi Jared,

          Thank you so much for your reply. I meant to get this data from the original dataset structure actually. Thank you all the same!

          Comment

          Working...
          X