Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using substring functions in Stata 16.1

    Dear Statalist,


    I have a string variable "comment" stored as "strL" that contains a mix of numbers, characters and spaces .

    storage display value
    variable name type format label variable label
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    comment strL %9s comment



    I need to extract information from it for the source of data the record comes from, one source can be "SoWMy", the other "NRI", ....etc potentially 10 different data sources. I have about 28,000 records.

    I am a bit lost on which substring command I should use.

    I tried "gen source1 = regexs(1) if(regexm(comment, "SoWMY"))" and got the error "invalid number, outside of allowed range" -

    Any advice would be most appreciated.

    Kind regards,

    Amani

  • #2
    Case insensitive: Either

    Code:
    gen source1= ustrregexm(comment, "(?i)sowmy")
    or

    Code:
    gen source1= strpos(lower(comment), "sowmy")>0
    Last edited by Andrew Musau; 23 Oct 2021, 05:14.

    Comment


    • #3
      Great thanks Andrew, the "ustrregexm" works not the "strpos" one - not sure why - kindly see below

      . gen source1a= ustrregexm(comment, "(?i)SoWMy")

      . gen source1b= strpos(lower(comment), "SoWMy")>0

      . tab1 source1*

      -> tabulation of source1a

      source1a | Freq. Percent Cum.
      ------------+-----------------------------------
      0 | 26,261 95.14 95.14
      1 | 1,341 4.86 100.00
      ------------+-----------------------------------
      Total | 27,602 100.00

      -> tabulation of source1b

      source1b | Freq. Percent Cum.
      ------------+-----------------------------------
      0 | 27,602 100.00 100.00
      ------------+-----------------------------------
      Total | 27,602 100.00


      Cheers,

      Amani

      Comment


      • #4
        Notice the lower case specification in my code:

        gen source1= strpos(lower(comment), "sowmy")>0
        If you want case-sensitive matching, then

        Code:
        gen source1= strpos(comment, "SoWMy")>0
        and correspondingly, remove "(?i)" in the ustrregexm command.

        Comment


        • #5
          Dear Andrew,
          Thanks very much for the tip. I think "ustrregexm" command is more accurate than "strpo" - see me two examples below:


          /* Example 1 */

          . gen source_MOH1=ustrregexm(comment, "MOH")
          . gen source_MOH2=strpos(comment, "MOH")>0
          . tab1 source_MOH1 source_MOH2

          /* That gives identical results */

          -> tabulation of source_MOH1
          source_MOH1 | Freq. Percent Cum.
          ------------+-----------------------------------
          0 | 27,042 97.97 97.97
          1 | 560 2.03 100.00
          ------------+-----------------------------------
          Total | 27,602 100.00

          -> tabulation of source_MOH2
          source_MOH2 | Freq. Percent Cum.
          ------------+-----------------------------------
          0 | 27,042 97.97 97.97
          1 | 560 2.03 100.00
          ------------+-----------------------------------
          Total | 27,602 100.00

          /* Example 2, I use "?i" with ustrregexn" and it picks up more records than "strpos" - I can't see why*/

          . drop source_MOH1

          . gen source_MOH1=ustrregexm(comment, "(?i)MOH")

          . tab1 source_MOH1

          -> tabulation of source_MOH1

          source_MOH1 | Freq. Percent Cum.
          ------------+-----------------------------------
          0 | 26,840 97.24 97.24
          1 | 762 2.76 100.00
          ------------+-----------------------------------
          Total | 27,602 100.00



          Thanks very much for your help.

          Amani

          Comment


          • #6
            Maybe my point in #2 and #4 is best illustrated with an example. Consider the following


            Code:
            . l
            
                 +-----------+
                 |      text |
                 |-----------|
              1. | xyz MOH33 |
              2. | ABC mOh22 |
              3. | efg moh11 |
                 +-----------
            We want to identify the substring "moh". But an issue is that in observation 1, the substring is capitalized, i.e., "MOH". In observation 2, it is a mixture of capital and small letters (also called upper-case and lower-case, respectively), i.e., "mOh". Finally, in observation 3, it is all in small letters, i.e., "moh". Now, what I mean by case-insensitive matching is that I should match the substring regardless of capitalization. This is achieved by either

            Code:
            * Example generated by -dataex-. For more info, type help dataex
            clear
            input str29 text
            "xyz MOH33"
            "ABC mOh22"
            "efg moh11"
            "hij CONTROL"
            end
            
            gen match1= ustrregexm(text, "(?i)moh")
            gen match2= strpos(lower(text), "moh")>0
            Notice the use of the -lower()- function in generating "match2" when using the -strpos()- function. Excluding "(?i)" and the -lower()- function in -ustrregexm()- and -strpos()-, respectively, executes case-sensitive matching, which as your example above shows, yields equivalent results.


            Res.:

            Code:
            . l
            
                 +-------------------------------+
                 |        text   match1   match2 |
                 |-------------------------------|
              1. |   xyz MOH33        1        1 |
              2. |   ABC mOh22        1        1 |
              3. |   efg moh11        1        1 |
              4. | hij CONTROL        0        0 |
                 +-------------------------------+

            Comment

            Working...
            X