Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find all user-written programs related to Text Mining / Content Analysis

    Dear Forum Members,

    I wish to find all avaliable adofiles related to Text Mining / Content Analysis.

    I know there are some hand outs in the Web as well as Stata Meeting presentations, but it seems we don't have an updated rendition on this.

    By typing "search content analysis" and "search text mining" in the Command Window I just got a few adofiles (3, to be precise).

    I wonder whether there is some sort of browsing by theme somewhere in the Web.

    Thanks in advance.
    Best regards,

    Marcos

  • #2
    I sort of hate it when people just keep files on their own site and don't register with Stata Corp so that findit finds them. If you don't know where specifically to look, I suppose you just have to google around.

    On the other hand, if something is that hard to find, maybe it isn't worth finding.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Thank you for the reply, Richard. I'm doing some content analysis, still resisting to perform it in other software. Having the programs registered and classified would be very helpful.
      Best regards,

      Marcos

      Comment


      • #4
        Provalis Research Company provide WordStat for Stata to apply text analytics techniques on any string variables stored in a Stata data file. WordStat combines natural language processing, content analysis and statistical techniques to quickly extract topics, patterns and relationships in large amount of text. See https://provalisresearch.com/product...tat-for-stata/ For all that I hope someday Stata can develope a thorough module like NLTK (Natural Language Toolkit) of Python.
        Last edited by Chen Samulsion; 14 Aug 2019, 07:27.

        Comment


        • #5
          Useful info. If NLTK is part of Python, then presumably it can be called directly via Stata version 16 (which has full Python integration). Yes?

          Comment


          • #6
            Dear Stephen Jenkins, in Stata 16, we can import nltk directly through Python. I'm a tyro in Python, but below is an example (run through do-file editor):
            Code:
            python
            import nltk
            text="Welcome readers. I hope you find it interesting. Please do reply."
            len(text)
            print(text)
            
            from nltk.tokenize import RegexpTokenizer
            tokenizer=RegexpTokenizer('\s+',gaps=True)
            tokenizer.tokenize("Don't hesitate to ask questions")
            
            from urllib import request
            url = "http://www.gutenberg.org/files/2554/2554-0.txt"
            response = request.urlopen(url)
            raw = response.read().decode('utf8')
            raw[:75]
            
            tree = '''
            he PRP B-NP
            accepted VBD B-VP
            the DT B-NP
            position NN I-NP
            of IN B-PP
            vice NN B-NP
            chairman NN I-NP
            of IN B-PP
            Carlyle NNP B-NP
            Group NNP I-NP
            , , O
            a DT B-NP
            merchant NN I-NP
            banking NN I-NP
            concern NN I-NP
            . . O
            '''
            nltk.chunk.conllstr2tree(tree, chunk_types=['NP']).draw()
            end

            Comment


            • #7
              There is a difference here. Official commands and those published through the Stata Journal are assigned keywords and the resulting database is bundled with your Stata as key files accessed by search. How up-to-date that is as far as you are concerned depends on how far you are using an up-to-date Stata.

              Beyond that, it's the Wild West, except that a StataCorp web crawler run daily finds what it can, which is why you can (usually) see pacakges on SSC if you search (the exceptions being that if the SSC site is down when the crawler is crawling, its contents are not found).

              Beyond that, therefore, you're reliant on what programmers said or did. A simpler analogue is that a standard web search for "Stata graphics" won't necessarily find stuff if the authors used words like plot or chart instead. I think your problem is more difficult than that.

              This is a tension difficult to resolve. In particular, everyone is abstractly in favour, or so I presume, of users making their additions to Stata visible or invisible in the way they wish, but the price of freedom is here anarchy. Otherwise put, there is so far as I know no-one at StataCorp -- or anywhere else -- trying to keep track by human means of what is publicly available, let alone classifying or cataloguing it systematically.

              Comment


              • #8
                Thanks, Chen. That worked for me. (My first ever play with Python!) For those watching, I first downloaded and installed Python 3.7 via Anaconda from here (I have Windows 10). It took a while. But once done, I didn't have to do a thing more, Python-wise. Chen's Stata do-file code ran straight away.

                Comment


                • #9
                  Thank you all for the insightful replies, suggestions and tips.

                  With regards to WordStat for Stata, I'm aware of that. However, since it is not a free package (I gather there is a free one-month trial) and I don't intend to perform content analysis on a very frequent basis, I believe it is not worth to pay so much ($850, for Academic and $ 5295 for commecial purchase, as shown here), more so when considering there are R packages freely available.

                  Hopefully Stata will provide some day a full suite of content analysis within its newest version.
                  Last edited by Marcos Almeida; 15 Aug 2019, 05:34.
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Yes, Stata has provided structural equation model, latent class analysis, finite mixture models, Python integration etc. in past several years (versions), we hope it can become more and more fantastic all-around and still easy to use.

                    Comment


                    • #11
                      What about -ssc describe txttool- ( Stata Journal, volume 14, number 4: dm0077)? I have not used it, but it looked interesting to me. Is this relevant?

                      Comment


                      • #12
                        @Mike Lacy:What about -ssc describe txttool- ( Stata Journal, volume 14, number 4: dm0077)? I have not used it, but it looked interesting to me. Is this relevant?
                        It's in interesting program, Mike, and provides "bag of words". I haven't practiced much with it, but I believe I'd need some hard fiddling to get the graphs.

                        Also, up to now, I failed to find any program in Stata which performs sentiment analysis and word cloud.
                        Best regards,

                        Marcos

                        Comment


                        • #13
                          Nick Cox
                          the price of freedom is here anarchy
                          my quote of the day.
                          Regards
                          --------------------------------------------------
                          Attaullah Shah, PhD.
                          Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
                          FinTechProfessor.com
                          https://asdocx.com
                          Check out my asdoc program, which sends outputs to MS Word.
                          For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

                          Comment


                          • #14
                            Attaullah Shah Thanks for the appreciation.

                            Comment


                            • #15
                              Marcos Almeida

                              Because there is apparently no Stata-focused, full-featured text analysis function and no Stata-focused text analysis program other than the commercial WordStat, you might consider the public domain text analysis software KH Coder at http://khcoder.net/en/ . I have used WordStat since 2005, but my dissertation students and I also have used KH Coder frequently. Frankly, there are very few functions I have needed that are available in WordStat that are not available in KH coder.

                              KH Coder provides word and code frequency analysis (with outcome data exported to Excel), word and code co-location and co-occurrence analysis (with similarity matrix exported to Excel), Key Word in Context (KWIC) analysis, and extensive text search and extraction functions. The program also offers several geospatial multivariate techniques such as multidimensional scaling, hierarchical cluster analysis, word or code co-occurrence network analysis, self-organizing maps, and correspondence analysis. The program has been actively maintained by its creator, Koichi Higuchi, since 2001 and is frequently updated. It is also available in several languages, including Portuguese and Spanish.

                              Another important feature of KH Coder is the ability to create user-specified coding schemes and to apply its text analytic and graphing tools to user-specified codes rather than only to raw text.

                              A set of example graphs and other results produced by KH Coder can be found at https://goo.gl/photos/ixn1sTM3jm8o11bP8 .

                              (By the way, KH Coder is built on an underlying layer of R.)

                              Cheers,
                              Red Owl

                              Comment

                              Working...
                              X