Code Improvement

Meng JI

Join Date: May 2021

Posts: 77
#1

Code Improvement

23 Apr 2022, 15:39

Hi everyone,

I have a question about online review analysis. I want to count how many reviews there are for each of the characters in each episode of a show. For instance, I have 5 characters whose names are Jack, Lisa, Kyle, Frank, and Mandy. I want to count in each episode, how many online reviews include the name "Jack" etc. The original example of the data structure is as below:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1 show_name byte episode str27 reviews "a" 1 "Jack did well!" "a" 1 "I like this one" "a" 1 "Good" "a" 1 "What's this?" "a" 1 "Lisa is angry loll" "a" 1 "Tired of the show" "a" 1 "Not as good as the last one" "a" 2 "Lisa killed" "a" 2 "Kyle is upset" "a" 2 "Jannifer looks good" "a" 2 "Lisa looks young" "a" 2 "Starving" "a" 2 "Kyle is back!" "a" 2 "Like Jack " "a" 2 "Lisa!!!!" end

I want to get a data structure as below in the end:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str1 show_name byte episode str5 id byte comment "a" 1 "Jack" 1 "a" 2 "Jack" 1 "a" 1 "Lisa" 1 "a" 2 "Lisa" 3 "a" 1 "Kyle" 0 "a" 2 "Kyle" 2 "a" 1 "Frank" 0 "a" 2 "Frank" 0 "a" 1 "Mandy" 0 "a" 2 "Mandy" 0 end

Currently, the code that I'm using is as below, but I think it's not very efficient. I want to see if there is any way to further improve the efficiency of the code since I have many shows and each show has different characters, sometimes can be up to 20. It's very hard to code them manually.

gen Count_Jack=0
gen Count_Lisa=0
gen Count_Kyle=0
gen Count_Frank=0
gen Count_Mandy=0

replace Count_Jack=1 if ustrpos(reviews, "Jack")>0
replace Count_Lisa=1 if ustrpos(reviews, "Lisa")>0
replace Count_Kyle=1 if ustrpos(reviews, "Kyle")>0
replace Count_Frank=1 if ustrpos(reviews, "Frank")>0
replace Count_Mandy=1 if ustrpos(reviews, "Mandy")>0

bysort episode: egen comment_Jack=total(Count_Jack)
bysort episode: egen comment_Lisa=total(Count_Lisa)
bysort episode: egen comment_Kyle=total(Count_Kyle)
bysort episode: egen comment_Frank=total(Count_Frank)
bysort episode: egen comment_Mandy=total(Count_Mandy)

by episode, sort: gen nvals = _n == 1
keep if nvals==1

keep show_name episode comment*

reshape long comment_, i(show_name episode) j(ID) string

Please let me know if you have any thoughts. Thank you and look forward to your reply.
Tags: None

Ken Chui

Join Date: Aug 2014
Posts: 1058

23 Apr 2022, 16:01

I'd convert the review to either all lower or upper case in case if anyone spelled names with irregular cases such as "JACK rocks!!" or "frank is awesome." The rest can be incorporated into a loop. To learn more, use command help foreach.

Code:

* Create a new one with lower case:
gen low_review = lower(reviews)

* Flag character
foreach x in jack lisa kyle frank mandy{
    gen mention_`x' = ustrpos(low_review, "`x'") > 0
}

* Reshape to long
gen id_entry = _n
reshape long mention_, i(id_entry) j(actor, string)

* Collapse
collapse (sum) mention_, by(show_name episode actor)

Comment

Jared Greathouse

Join Date: Sep 2021
Posts: 2170

23 Apr 2022, 16:56

Maybe I've misunderstood, but wouldn't this work?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 show_name byte episode str5 id byte comment
"a" 1 "Jack"  1
"a" 2 "Jack"  1
"a" 1 "Lisa"  1
"a" 2 "Lisa"  3
"a" 1 "Kyle"  0
"a" 2 "Kyle"  2
"a" 1 "Frank" 0
"a" 2 "Frank" 0
"a" 1 "Mandy" 0
"a" 2 "Mandy" 0
end

set obs 11

replace id = "Lisa" in 11

replace episode = 2 in 11

replace show = "a" in 11


cls
* Flag character

bys episode id: egen mention = total(strpos(id, id) > 0)

Comment

Meng JI

Join Date: May 2021

Posts: 77
#4

23 Apr 2022, 19:41

Originally posted by Ken Chui View Post

I'd convert the review to either all lower or upper case in case if anyone spelled names with irregular cases such as "JACK rocks!!" or "frank is awesome." The rest can be incorporated into a loop. To learn more, use command help foreach.

Code:

* Create a new one with lower case: gen low_review = lower(reviews) * Flag character foreach x in jack lisa kyle frank mandy{ gen mention_`x' = ustrpos(low_review, "`x'") > 0 } * Reshape to long gen id_entry = _n reshape long mention_, i(id_entry) j(actor, string) * Collapse collapse (sum) mention_, by(show_name episode actor)

Hi Ken,

The code works very well for me! Saved lots of time. Thank you so much for your help!
Comment

Meng JI

Join Date: May 2021
Posts: 77

23 Apr 2022, 20:29

Originally posted by Jared Greathouse View Post

Maybe I've misunderstood, but wouldn't this work?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str1 show_name byte episode str5 id byte comment
"a" 1 "Jack" 1
"a" 2 "Jack" 1
"a" 1 "Lisa" 1
"a" 2 "Lisa" 3
"a" 1 "Kyle" 0
"a" 2 "Kyle" 2
"a" 1 "Frank" 0
"a" 2 "Frank" 0
"a" 1 "Mandy" 0
"a" 2 "Mandy" 0
end

set obs 11

replace id = "Lisa" in 11

replace episode = 2 in 11

replace show = "a" in 11


cls
* Flag character

bys episode id: egen mention = total(strpos(id, id) > 0)

Hi Jared,

Thank you so much for your reply. I meant to get this data from the original dataset structure actually. Thank you all the same!

Announcement

Comment

Comment

Comment

Comment