Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Idenifying list of Keywords

    Hello,

    I'm interested in being able to run code through the data of a string variable so that it outputs a list of words and how frequently these words appear.

    My case: I am trying to categorize a list of research projects based on their project title into research categories. For example, I want to be able to search through each project title and identify common words being used. Let's assume that one of the keywords is "cardiovascular" and it appears 10 times. I want to be able to identify that and create a new variable "research_area" and set it = "health research" if the keyword "cardiovascular" appears in the project title. Note that, there is no way for me to know whether that keyword exists to begin with (unless I read through each project title).

    I hope I explained myself well.

    Thank you!

  • #2
    I don't understand why "cardiovascular" implies "health research" (and nothing else does???).

    You may need to do a lot of work to get where you want. But this may help a little:

    Here tabsplit is from tab_chi (SSC). Note that this is a small dataset with some standardization of names. In a real-life dataset it won't be true that "Cadillac" is always "Cad.", or the equivalent Stata will be very literal about differences in spelling or abbreviation. That's where a lot of work is likely to be needed to standardize.

    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . tabsplit make, sort
    
       Make and |
          Model |      Freq.     Percent        Cum.
    ------------+-----------------------------------
          Buick |          7        4.52        4.52
           Olds |          7        4.52        9.03
          Chev. |          6        3.87       12.90
          Merc. |          6        3.87       16.77
          Pont. |          6        3.87       20.65
          Plym. |          5        3.23       23.87
         Datsun |          4        2.58       26.45
          Dodge |          4        2.58       29.03
             VW |          4        2.58       31.61
            AMC |          3        1.94       33.55
           Cad. |          3        1.94       35.48
          Linc. |          3        1.94       37.42
         Toyota |          3        1.94       39.35
           Audi |          2        1.29       40.65
           Ford |          2        1.29       41.94
          Honda |          2        1.29       43.23
             Le |          2        1.29       44.52
            200 |          1        0.65       45.16
            210 |          1        0.65       45.81
            260 |          1        0.65       46.45
           320i |          1        0.65       47.10
           5000 |          1        0.65       47.74
            510 |          1        0.65       48.39
            604 |          1        0.65       49.03
            810 |          1        0.65       49.68
             88 |          1        0.65       50.32
             98 |          1        0.65       50.97
         Accord |          1        0.65       51.61
          Arrow |          1        0.65       52.26
            BMW |          1        0.65       52.90
         Bobcat |          1        0.65       53.55
            Car |          1        0.65       54.19
          Carlo |          1        0.65       54.84
       Catalina |          1        0.65       55.48
         Celica |          1        0.65       56.13
        Century |          1        0.65       56.77
          Champ |          1        0.65       57.42
       Chevette |          1        0.65       58.06
          Civic |          1        0.65       58.71
           Colt |          1        0.65       59.35
        Concord |          1        0.65       60.00
    Continental |          1        0.65       60.65
        Corolla |          1        0.65       61.29
         Corona |          1        0.65       61.94
         Cougar |          1        0.65       62.58
           Cutl |          1        0.65       63.23
        Cutlass |          1        0.65       63.87
         Dasher |          1        0.65       64.52
          Delta |          1        0.65       65.16
        Deville |          1        0.65       65.81
         Diesel |          1        0.65       66.45
       Diplomat |          1        0.65       67.10
       Eldorado |          1        0.65       67.74
        Electra |          1        0.65       68.39
           Fiat |          1        0.65       69.03
         Fiesta |          1        0.65       69.68
       Firebird |          1        0.65       70.32
            Fox |          1        0.65       70.97
            GLC |          1        0.65       71.61
          Grand |          1        0.65       72.26
        Horizon |          1        0.65       72.90
         Impala |          1        0.65       73.55
        LeSabre |          1        0.65       74.19
         Magnum |          1        0.65       74.84
         Malibu |          1        0.65       75.48
           Mans |          1        0.65       76.13
           Mark |          1        0.65       76.77
        Marquis |          1        0.65       77.42
          Mazda |          1        0.65       78.06
        Monarch |          1        0.65       78.71
          Monte |          1        0.65       79.35
          Monza |          1        0.65       80.00
        Mustang |          1        0.65       80.65
           Nova |          1        0.65       81.29
          Omega |          1        0.65       81.94
           Opel |          1        0.65       82.58
          Pacer |          1        0.65       83.23
        Peugeot |          1        0.65       83.87
        Phoenix |          1        0.65       84.52
           Prix |          1        0.65       85.16
         Rabbit |          1        0.65       85.81
          Regal |          1        0.65       86.45
          Regis |          1        0.65       87.10
        Renault |          1        0.65       87.74
        Riviera |          1        0.65       88.39
        Sapporo |          1        0.65       89.03
       Scirocco |          1        0.65       89.68
        Seville |          1        0.65       90.32
        Skylark |          1        0.65       90.97
         Spirit |          1        0.65       91.61
            St. |          1        0.65       92.26
       Starfire |          1        0.65       92.90
         Strada |          1        0.65       93.55
         Subaru |          1        0.65       94.19
        Sunbird |          1        0.65       94.84
           Supr |          1        0.65       95.48
       Toronado |          1        0.65       96.13
              V |          1        0.65       96.77
     Versailles |          1        0.65       97.42
         Volare |          1        0.65       98.06
          Volvo |          1        0.65       98.71
           XR-7 |          1        0.65       99.35
         Zephyr |          1        0.65      100.00
    ------------+-----------------------------------
          Total |        155      100.00

    Comment


    • #3
      Also, see whether ngram (from http://www.schonlau.net/stata) helps.

      Best
      Daniel

      Comment


      • #4
        Thank you!

        Comment

        Working...
        X