How to find all user-written programs related to Text Mining / Content Analysis

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#1

How to find all user-written programs related to Text Mining / Content Analysis

14 Aug 2019, 05:51

Dear Forum Members,

I wish to find all avaliable adofiles related to Text Mining / Content Analysis.

I know there are some hand outs in the Web as well as Stata Meeting presentations, but it seems we don't have an updated rendition on this.

By typing "search content analysis" and "search text mining" in the Command Window I just got a few adofiles (3, to be precise).

I wonder whether there is some sort of browsing by theme somewhere in the Web.

Thanks in advance.

Best regards,

Marcos
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 4992
#2

14 Aug 2019, 06:13

I sort of hate it when people just keep files on their own site and don't register with Stata Corp so that findit finds them. If you don't know where specifically to look, I suppose you just have to google around.

On the other hand, if something is that hard to find, maybe it isn't worth finding.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#3

14 Aug 2019, 07:13

Thank you for the reply, Richard. I'm doing some content analysis, still resisting to perform it in other software. Having the programs registered and classified would be very helpful.

Best regards,

Marcos
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 923
#4

14 Aug 2019, 07:24

Provalis Research Company provide WordStat for Stata to apply text analytics techniques on any string variables stored in a Stata data file. WordStat combines natural language processing, content analysis and statistical techniques to quickly extract topics, patterns and relationships in large amount of text. See https://provalisresearch.com/product...tat-for-stata/ For all that I hope someday Stata can develope a thorough module like NLTK (Natural Language Toolkit) of Python.

Last edited by Chen Samulsion; 14 Aug 2019, 07:27.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#5

14 Aug 2019, 07:57

Useful info. If NLTK is part of Python, then presumably it can be called directly via Stata version 16 (which has full Python integration). Yes?
Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 923

14 Aug 2019, 09:14

Dear Stephen Jenkins, in Stata 16, we can import nltk directly through Python. I'm a tyro in Python, but below is an example (run through do-file editor):

Code:

python
import nltk
text="Welcome readers. I hope you find it interesting. Please do reply."
len(text)
print(text)

from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer('\s+',gaps=True)
tokenizer.tokenize("Don't hesitate to ask questions")

from urllib import request
url = "http://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode('utf8')
raw[:75]

tree = '''
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
of IN B-PP
vice NN B-NP
chairman NN I-NP
of IN B-PP
Carlyle NNP B-NP
Group NNP I-NP
, , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP
. . O
'''
nltk.chunk.conllstr2tree(tree, chunk_types=['NP']).draw()
end

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#7

14 Aug 2019, 10:04

There is a difference here. Official commands and those published through the Stata Journal are assigned keywords and the resulting database is bundled with your Stata as key files accessed by search. How up-to-date that is as far as you are concerned depends on how far you are using an up-to-date Stata.

Beyond that, it's the Wild West, except that a StataCorp web crawler run daily finds what it can, which is why you can (usually) see pacakges on SSC if you search (the exceptions being that if the SSC site is down when the crawler is crawling, its contents are not found).

Beyond that, therefore, you're reliant on what programmers said or did. A simpler analogue is that a standard web search for "Stata graphics" won't necessarily find stuff if the authors used words like plot or chart instead. I think your problem is more difficult than that.

This is a tension difficult to resolve. In particular, everyone is abstractly in favour, or so I presume, of users making their additions to Stata visible or invisible in the way they wish, but the price of freedom is here anarchy. Otherwise put, there is so far as I know no-one at StataCorp -- or anywhere else -- trying to keep track by human means of what is publicly available, let alone classifying or cataloguing it systematically.
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#8

14 Aug 2019, 10:12

Thanks, Chen. That worked for me. (My first ever play with Python!) For those watching, I first downloaded and installed Python 3.7 via Anaconda from here (I have Windows 10). It took a while. But once done, I didn't have to do a thing more, Python-wise. Chen's Stata do-file code ran straight away.
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

15 Aug 2019, 05:29

Thank you all for the insightful replies, suggestions and tips.

With regards to WordStat for Stata, I'm aware of that. However, since it is not a free package (I gather there is a free one-month trial) and I don't intend to perform content analysis on a very frequent basis, I believe it is not worth to pay so much ($850, for Academic and $ 5295 for commecial purchase, as shown here), more so when considering there are R packages freely available.

Hopefully Stata will provide some day a full suite of content analysis within its newest version.

Last edited by Marcos Almeida; 15 Aug 2019, 05:34.

Best regards,

Marcos
1 like
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 923
#10

15 Aug 2019, 07:31

Yes, Stata has provided structural equation model, latent class analysis, finite mixture models, Python integration etc. in past several years (versions), we hope it can become more and more fantastic all-around and still easy to use.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#11

15 Aug 2019, 09:29

What about -ssc describe txttool- ( Stata Journal, volume 14, number 4: dm0077)? I have not used it, but it looked interesting to me. Is this relevant?
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#12

15 Aug 2019, 09:42

@Mike Lacy:What about -ssc describe txttool- ( Stata Journal, volume 14, number 4: dm0077)? I have not used it, but it looked interesting to me. Is this relevant?

It's in interesting program, Mike, and provides "bag of words". I haven't practiced much with it, but I believe I'd need some hard fiddling to get the graphs.

Also, up to now, I failed to find any program in Stata which performs sentiment analysis and word cloud.

Best regards,

Marcos
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1669
#13

15 Aug 2019, 12:29

Nick Cox

the price of freedom is here anarchy

my quote of the day.

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#14

15 Aug 2019, 12:51

Attaullah Shah Thanks for the appreciation.
1 like
Comment
Red Owl

Join Date: Nov 2016

Posts: 127
#15

15 Aug 2019, 13:23

Marcos Almeida

Because there is apparently no Stata-focused, full-featured text analysis function and no Stata-focused text analysis program other than the commercial WordStat, you might consider the public domain text analysis software KH Coder at http://khcoder.net/en/ . I have used WordStat since 2005, but my dissertation students and I also have used KH Coder frequently. Frankly, there are very few functions I have needed that are available in WordStat that are not available in KH coder.

KH Coder provides word and code frequency analysis (with outcome data exported to Excel), word and code co-location and co-occurrence analysis (with similarity matrix exported to Excel), Key Word in Context (KWIC) analysis, and extensive text search and extraction functions. The program also offers several geospatial multivariate techniques such as multidimensional scaling, hierarchical cluster analysis, word or code co-occurrence network analysis, self-organizing maps, and correspondence analysis. The program has been actively maintained by its creator, Koichi Higuchi, since 2001 and is frequently updated. It is also available in several languages, including Portuguese and Spanish.

Another important feature of KH Coder is the ability to create user-specified coding schemes and to apply its text analytic and graphing tools to user-specified codes rather than only to raw text.

A set of example graphs and other results produced by KH Coder can be found at https://goo.gl/photos/ixn1sTM3jm8o11bP8 .

(By the way, KH Coder is built on an underlying layer of R.)

Cheers,
Red Owl
2 likes
Comment

Announcement

How to find all user-written programs related to Text Mining / Content Analysis

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment