Word cloud and sentiment analysis (text mining - content analysis) in Stata

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#1

Word cloud and sentiment analysis (text mining - content analysis) in Stata

02 Jan 2018, 04:31

Dear Forum Members,

I'll need to apply content analysis (text mining) strategies in a recent project of mine. However, I've found far less information/resources in Stata, if compared with R, for example. That said, I really wish to stick with Stata resources as much as possible for the analysis.

With regards to the analysis of words, I'm delving with the user-written ngram, precoin and coin. Also, I checked out other programs, as mentioned in this Stata Meeting.

That said, I'm facing a couple of obstacles: first, the issue on the exceeding amount of words, as previously reported here. (For this, hopefully, a higher flavour of Stata - instead of IC - will do the trick, and I decided to do the upgrade).

Besides, I got the impression that, contrary to what I'm getting with R, most programs in Stata won't perform well with large chunks of texts as well as a large sample size, as it will be my scenario.

Second, unfortunately, I haven't yet found command/program concerning key steps of text mining I'm eager to apply, such as sentiment analysis graphs and word cloud renditions.

On account of this situation, I wonder whether you could help with some guidance.

Thank you in advance.

Best regards,

Marcos
Tags: None
River Huang

Join Date: Mar 2016

Posts: 1906
#2

03 Jan 2018, 00:50

Dear Marcos, I searched and found that the `commercial' software at https://provalisresearch.com/product...tat-for-stata/ might be related to your topics.

Ho-Chuan (River) Huang
Stata 19.0, MP(4)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

03 Jan 2018, 06:37

Thanks you for the reply, River. Indeed, I was rather aware of Wordstat and I had visited this page. By the way, this software has even a "for Stata" version". That said, by getting it, we'd basically have a "foreign" software running within Stata. Actually, the commands are not "Stataish", but done within SPSS-like windows for they relate to the "foreign" software itself. Last but not least, albeit the offering of a free 30-day trial version, this software is very expensive, more so if we consider I intend to do some text mining as an extraordinary task, not a full-time work, hence paying so much wouldn't compensate IMHO. Thank you again.

Best regards,

Marcos
Comment
Red Owl

Join Date: Nov 2016

Posts: 127
#4

03 Jan 2018, 07:54

Marcos,

KH Coder is a public domain software which runs with R in the background and offers features similar to WordStat. You might find KH Coder useful for the calculations you wish to perform directly or you may want to use it to create specialized data sets such as word co-occurrence similarity matrices in .csv format that can be imported into Stata for further analysis. My doctoral students and I have used KH Coder along with Stata for several years.

KH Coder (which is also available in Portuguese) is available under a GNU Public License at http://khc.sourceforge.net/en/ . Even though the stable version is version 2, I recommend version 3, which has been in testing for more than two years and which has worked fine for me. The manual for version 3 is available at http://khc.sourceforge.net/en/manual_en_v3.pdf . You can also view several screenshots demonstrating the features offered by KH Coder on the software's main web page.

Red Owl
Stata/IC 15.1, Windows 10 (64-bit)
1 like
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#5

03 Jan 2018, 08:38

Red Owl Thank you for the information about KH Coder (under R). I've never heard of, I'm curious to know what it adds up to R's current resources and I'll surely take a close look at it. Well, if I understood right, for now, let me at least suggest that (a full-fledged suite of ) text mining resources take part in the wishlist for Stata 16 !

Best regards,

Marcos
Comment
Red Owl

Join Date: Nov 2016

Posts: 127
#6

03 Jan 2018, 08:53

Marcos,

Just for clarification, KH Coder does not run "under R" but, rather, is a free-standing program with its own graphic user interface that loads and employs an R kernel in the background. KH Coder users do not need to have R installed on their systems and do not need to learn R programming or data management.

I hope you find KH Coder helpful, and I second your suggestion to add more text analysis features to Stata. I expect, however, that StataCorp's arrangement with WordStat in producing WordStat for Stata makes it unlikely that Stata will add new text analysis features in the near future.

Good luck.

Red Owl
Stata/IC 15.1, Windows 10 (64-bit)
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#7

03 Jan 2018, 12:39

Thank you again for the information, Red. I appreciate it.

Out of curiosity, as you correctly remarked, I see that the background R kernel even presents a different (i.e. "older") version from "my" R. With regards to improving text mining resources in Stata, better yet, making it an ordinary suite of the statistical package, and still agreeing that the points you presented might well entail some sort of inauspicous forecast, I'll keep my fingers crossed on an optimistic note, for the increasing interest and range of application of this (somewhat) "recent" quantitative approach in other packages may well prompt Stata developers to take this new endeavour, as they recently did by adding a bunch of cutting-edge resources with the release of Stata 15.

Best regards,

Marcos
Comment
Red Owl

Join Date: Nov 2016

Posts: 127
#8

03 Jan 2018, 15:25

Marcos,

There is a way to have KH Coder point to your up-to-date version of R and avoid having two version of R on your system. If you're interested please send me a direct message, and I'll give you the steps.

I like your perspective on the possibility that Stata will add more powerful text analysis features, and I hope that my pessimism is unwarranted. I had made a similarly pessimistic -- but now disproved -- forecast about the possibility that Stata would ever add Latent Class Analysis. I thought Stata would not add LCA because it could already be done with the gllamm ado program (SSC) or with a plugin from the Methodology Center at Penn State University. I was delighted to be proved wrong when Stata 15 came out!

Cheers,

Red Owl
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

04 Jan 2018, 01:53

Hello Red.

Thank you for offering help to let just the newest R version chime in when using KH Coder.

For now, I'm starting to try the package out and see what happens in general terms.

Hopefully we'll be able to do all this in the near future within the "standard" Stata 16!

Best regards,

Marcos
Comment
Tiago Pereira

Join Date: Jan 2016

Posts: 375
#10

04 Jan 2018, 15:26

Marcos,

It may be time to start to work with R and Python within Stata.

Tiago
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#11

04 Jan 2018, 22:31

Hello Tiago,

I agree. That would be a nice option. Thanks

Best regards,

Marcos
Comment

Announcement

Word cloud and sentiment analysis (text mining - content analysis) in Stata

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment