Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to find particular word in string in stata

    Hello together,
    Is there a command in Stata which to search in string variable for a particular word and to return only this word. One example: When I use functions: -regexm-, -strpos- or -strmatch- and I am searching for "INC" only Stata return all observation that contain "INC" like INCOME or something else, but I need only observations with "INC"
    Thanks in advance

  • #2
    How about
    Code:
    list stringvar if strops(stringvar, " INC ") | substr(stringvar, 1, 4) == "INC " | substr(stringvar, -4, 4) == " INC"

    Comment


    • #3
      the question is not completely clear to me; is "INC" short for "incorporated"? if yes, will it sometimes be immediately followed by a period? if yes, Clyde's code will not work; if there is sometimes a period and sometimes not, I would just add additional conditions to Clyde's that include the period

      Comment


      • #4
        Yes INC is short for incorporated. and sometimes there is a period at the end and sometimes not. Can i use the same code for whole words like "TEAM" for example?

        Comment


        • #5
          yes - if Clyde's code is not clear to you, check the help files for the functions he uses (e.g., strpos (he has a typo in his code) and substr)

          Comment


          • #6
            note that all of the above assumes that your text really is all capitals - if not, you can either make a more complicated statement or, and I suggest this, use the "upper" function before using the suggested code

            Comment


            • #7
              When searching for text on whole word boundaries, I usually avoid the start and end of string corner cases by adding a space at each end. Something like

              Code:
              gen s = " " + stringvar + " "
              list if strpos(s," INC ")
              I also find listsome (from SSC) useful for this type of work.

              Comment


              • #8
                Not to give up too soon on regular expressions.
                Code:
                clear
                input str10 corp
                "INC       "
                "INC.      "
                "INCOME    "
                " INC      "
                " INC.     "
                "ZINC      "
                " INCOME   "
                "       INC"
                "      INC."
                "      ZINC"
                end
                generate m = regexm(corp,"^INC[. ]| INC[. ]| INC[.]?$")
                list, clean
                Code:
                             corp   m  
                  1.   INC          1  
                  2.   INC.         1  
                  3.   INCOME       0  
                  4.    INC         1  
                  5.    INC.        1  
                  6.   ZINC         0  
                  7.    INCOME      0  
                  8.          INC   1  
                  9.         INC.   1  
                 10.         ZINC   0

                Comment


                • #9
                  Thanks a lot all of you. Both -regexm- and -strpos- work perfectly.

                  Comment

                  Working...
                  X