Text analysis - find frequency of several keywords within several strings + stop words removal + sentiment analysis

Robert Adrian Piper

Join Date: Jan 2020

Posts: 12
#1

Text analysis - find frequency of several keywords within several strings + stop words removal + sentiment analysis

31 Jan 2020, 09:20

Hello! I am a new user of Stata and I have a problem with a task I have to solve.
As a part of my thesis, I have to do a text analysis. Afterwards, I will perform the statistical analysis with Stata and I am not sure whether I can use Stata for the text analysis as well.

I have tried it and researched a lot. But I am not sure whether my approach is correct.
There are three aspects I want to adress in this post:

As the most relevant aspect, I would like to generate a new variable that shows the frequency of a key word (substring) in a text (string / variable), for instance "machine learning". What is the best way? Would you recommend to use Stata or the integration of Wordstat or Python? There are three options but so far neither of them has been succesful.
I have installed Wordstat in Stata and used it to import PDF files of annual reports. These reports are stored as text / string in the variable DOCUMENT in Stata. I would like to use Wordstat for the text analysis as well. So, I used User>Wordstat>content analysis and I generated a dictionary with key words. However, when using frequencies I get the error "No valid cases".

So far, I have found options in Stata that show if a particular substring is inside a string or not (strpos, regexm, substr, subinst) , but I need to know the frequency. Noccur is a command that offers this, the only one in my opinion. However, I have used this command and the calculated frequency for some key words in Stata is lower than the actual frequency I have found in the PDF file or the text file of the particular annual report, using command F.

Python is possible as an integration in Stata but I have not figured out how to interact, how to use the variable in the Stata table in the Python command and how to export the results of Python in the Stata table as a new variable. In Python I have found the regex command re.findall(pattern, string, flags=0). Is it recommended to use Python instead of Stata for the text analysis and to do statistics in Stata afterwards? Should I install Python and load the files there? Then, I would need to save the Python table as an Excel file and to create a Stata file from it. With one variable that is the same in both stata tables the merging is possible afterwards. Is that correct?

In a second step, I would like to use stop words removal in order to have the whole amount of relevant words of a document. The variable DOCUMENT stores text files as strings that contain relevant words and stop words. I would like to generate a new variable that contains the strings without stop words. I think it is possible to use the coomand txttool. So far it has not been succesful.
For the stop words removal I would use lists of Wordstat. WordStat provides stop words lists in the 4 languages English, French, German, Spanish that I need. How can I use Stata and or Wordstat? Is it recommended to use Stata and or WordStat or Python?

Moreover, I am not sure if a sentiment analysis is possible in Stata itself or with the integration of Python. The sentiment analysis could show which documents use rather positive or negative words in the same sentences that contain a particular key word.

Thank you and best regards
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2404
#2

31 Jan 2020, 10:14

I don't have any experience with Python integration or Wordstat, so I can't speak to that. But as for "regular" Stata approaches, here are some thoughts:

Beyond the -noccur- command in the user-written -egenmore- package, there's a user-written add-on of tools for text processing. See -ssc describe txttool- . The help for that command (which I have not used) is a bit terse, so you would likely need to read the associated Stata Journal article cited in the help file. It's possible those commands only work with ASCII text, which I suppose might not be ok in your situation.

There are also ways to count the frequency of a string on a more do-it-yourself basis that would involve writing a program to use various string functions and loop through successive searches in your "document" variable. That's not a trivial thing to do, so let's hope that -txttool- will work for you.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#3

31 Jan 2020, 11:34

Also see the various answers in https://www.statalist.org/forums/for...nd-inefficient

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment

Announcement

Text analysis - find frequency of several keywords within several strings + stop words removal + sentiment analysis

Comment

Comment