Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Text mining for long string variables (opinions from survey)

    Hello,

    I am working with data that is presented in the following way:

    id opinion
    001 I really liked the food, but the fries were coldd
    002 would recommend
    003 The restaurant ambience was nice, but the service left much to be desired w slow and inattentive staff


    And so on. I am dealing with words miswritten, long sentences, etc...

    I would like to extract the basic key words (2-3) from the opinions column. (For example: cold_food, slow_staff, good_ambient)

    Thanks in advance.



  • #2
    you provide very little information so this is just a guess; download and take a look at -txttool- from the Stata Journal; use -search- to find and download; note that the SJ article is also freely available; you should be able to get it via the link in the top right of the help file (after, obviously, you have downloaded it)

    Comment


    • #3
      Originally posted by Rich Goldstein View Post
      you provide very little information so this is just a guess; download and take a look at -txttool- from the Stata Journal; use -search- to find and download; note that the SJ article is also freely available; you should be able to get it via the link in the top right of the help file (after, obviously, you have downloaded it)
      Thanks for your reply.

      I have tried to use -txttool-, and Im following the instructions provided in the guide. However, I cannot put this loop to use
      Code:
      :. generate status_numeric = (status=="M")
      . quietly foreach x of varlist w_* {
      2. summarize `x´, meanonly
      3. if r(mean) > .05 {
      4. tab `x´ status_numeric, all
      5. if abs(r(taub)) >.05 {
      6. noisily display "`x´" r(taub)
      }
      }
      }

      This is my code, and I keep getting a "varlist required" error

      Code:
      ds w_*
      di "`w(varlist)'"
      local wvars `w(varlist)'
      
      quietly foreach var of varlist `wvars' {
          summarize `var', meanonly
          if r(mean) > 0.05 {
              tab `var' myvariable_numeric, all
              if abs(r(taub)) > 0.05 {
                  noisily display "`var' `r(taub)'"
              }
          }
      }

      Comment


      • #4
        Regarding your first piece of code, you say "cannot ... put to use." What do you mean by this? Did it give an error message? Did it not do what you wanted? In any event, one that thing that seems odd is you use "summarize" as would be typical for continuous rather than categorical variables, but then you include those same variables in a twoway -tab-, as would be typical for categorical variables. While it's not impossible that you would want to do this, it's sufficiently unusual to make one think that's not really what you want.

        Regarding your second piece of code: The -ds- command would return r(varlist), but not "w(varlist)", so your local named "wvars" will be empty. When you refer to `wvars' in your -foreach- loop, there will be nothing there, that is, nothing after the -varlist- specification.

        To diagnose a problem such as you have, the -set trace on- command is useful, and I'd suggest you consult -help set trace- for use in the future.

        Regardless of the preceding, note that in the second piece of code, creating the local wvars was not necessary for your -foreach- loop. The following would have worked:
        Code:
        foreach var of varlist w_* {
        ....
        }
        Finally, although a diagnosis was possible for this situation without seeing a data example, providing one using --dataex-, as indicated in the StataList FAQ for new members, often gives context to a question and makes it easier to answer even if it's not absolutely necessary.
        Last edited by Mike Lacy; 11 Feb 2024, 11:50.

        Comment


        • #5
          Originally posted by Mike Lacy View Post
          Regarding your first piece of code, you say "cannot ... put to use." What do you mean by this? Did it give an error message? Did it not do what you wanted? In any event, one that thing that seems odd is you use "summarize" as would be typical for continuous rather than categorical variables, but then you include those same variables in a twoway -tab-, as would be typical for categorical variables. While it's not impossible that you would want to do this, it's sufficiently unusual to make one think that's not really what you want.

          Regarding your second piece of code: The -ds- command would return r(varlist), but not "w(varlist)", so your local named "wvars" will be empty. When you refer to `wvars' in your -foreach- loop, there will be nothing there, that is, nothing after the -varlist- specification.

          To diagnose a problem such as you have, the -set trace on- command is useful, and I'd suggest you consult -help set trace- for use in the future.

          Regardless of the preceding, note that in the second piece of code, creating the local wvars was not necessary for your -foreach- loop. The following would have worked:
          Code:
          foreach var of varlist w_* {
          ....
          }
          Finally, although a diagnosis was possible for this situation without seeing a data example, providing one using --dataex-, as indicated in the StataList FAQ for new members, often gives context to a question and makes it easier to answer even if it's not absolutely necessary.
          Thanks for your reply.

          It did give a error message.

          Regarding the first piece of code, I took that one from the txttool, as Im trying to replicate it.

          The code you provided does not work for me, as I get the following error:

          Code:
          . foreach var of varlist w_*{
            2.     summarize `var´, meanonly
            3.     if r(mean) > 0.05 {
            4.         tab `var´ s_precioc_numeric, all
            5.         if abs(r(taub)) > 0.05 {
            6.             display "`var' `r(taub)'"
            7.         }
            8.     }
            9. }
          ` invalid name
          r(198);
          P: I know I could have provided an example of my data, but I'm not dealing with public data.

          Comment


          • #6
            the FAQ deals with the issue of confidential data - what is needed is a realistic example that shows the structure of the data and examples that are sufficiently similar so the need to guess is reduced or even eliminated

            Comment


            • #7
              Your "` invalid name" error message probably stems from the presence of the wrong single quote "tick" mark in
              Code:
              tab `var´ s_precioc_numeric, all
              The character in red should be an ordinary apostrophe, " ' "

              Even with private data, you can provide a data example by using -datex- to show the structure and then typing in "fake" data of your own.

              Comment

              Working...
              X