Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • substr within a subinstr

    I have a string field city with several internal ", " sub-strings (comma followed by a space), some of which appear at the end of the string. For example:

    Tulsa, OK
    Wichita, KS
    Sioux Falls, SD, (note the trailing space)

    For any row with a trailing comma-space, I want to erase that segment of the string (the comma and space). I expected to achieve the desired functionality with the command below, but it does not seem to be executing as hoped. In fact, it seems to delete the entire string value for those rows meeting the if condition.

    replace city=subinstr(substr(city, -2, 2), ", ", "", .) if substr(hs_address, -2, 2)==", "

    Any thoughts on how this is being mis-specified?





  • #2
    Hi Dan,

    This might work, if I understand your problem correctly. Your condition though, concerns the city variable, correct? Not hs_address?

    Code:
    replace city = substr(city,1,length(city)-2) if substr(city,-2,.)==", "

    Comment


    • #3
      Thanks Reese. Yes, I meant to refer to city, not hs_address.

      I've come up with some alternate solutions (yours may work as well), but my main question deals with the failure of the initial solution to work as intended. Am I misunderstanding how the substr and subinstr commands work?

      Comment


      • #4
        Code:
        clear
        input str20 city
        "Tulsa, OK"       
        "Wichita, KS"     
        "Sioux Falls, SD, "
        end
        gen temp= substr(city, -2, 2)
        gen temp2 =subinstr(substr(city, -2, 2), ", ", "", .)
        leads to:
        Code:
             +----------------------------------+
             |              city   temp   temp2 |
             |----------------------------------|
          1. |         Tulsa, OK     OK      OK |
          2. |       Wichita, KS     KS      KS |
          3. | Sioux Falls, SD,      ,          |
             +----------------------------------+
        So, the bit substr(city, -2, 2) takes the last two letters of the city ("OK", "KS", or ", ")
        The next bit
        subinstr(substr(city, -2, 2), ", ", "", .) takes those two letters, and deletes the bits ", ". At which point you are indeed left with an empty string, as you asked Stata to do.
        Breaking these sort of compound transformations into individual steps can help understand them a bit better. Then when youre happy with the results, stick them all in one line of code.

        Comment


        • #5
          I see. One thing wrong with your initial code is that what you're choosing to substitute into city with substr is substr(city, -2, 2), which is equivalent to ", " for all observations with the trailing comma+space. The remaining ", " is then subject to your specification to remove ", " and you are thus left with nothing.

          If you tried, however, to substitute in the city string minus the last two characters like I noted above (i.e. substr(city,1,length(city)-2)), you'd encounter another problem, because the subinst command replaces the first n occurrences. You'd end up with values like "TulsaOK".

          Hope this helps.

          Reese

          Comment


          • #6
            Ok, this helps clarify things a bit. Thanks much to both of you.

            Comment


            • #7
              Another way to think about it:

              1. In my experience leading and trailing spaces are not informative. I certainly guess that to be true for placenames. So I would be happy to

              Code:
              replace city = trim(city)
              2. Now the problem is just trailing commas. You can remove them like this:

              Code:
              replace city = substr(city, 1, length(city) - 1) if substr(city, -1, 1) == ","
              This is a slight variation on Reese's suggestion in #2. It is slightly more general in catching multiple trailing spaces as you would wish.

              Or yet another way, using Jorrit's data example from #4:

              Code:
              . split city, parse(,) 
              variables created as string: 
              city1  city2
              
              . list 
              
                   +-----------------------------------------+
                   |              city         city1   city2 |
                   |-----------------------------------------|
                1. |         Tulsa, OK         Tulsa      OK |
                2. |       Wichita, KS       Wichita      KS |
                3. | Sioux Falls, SD,    Sioux Falls      SD |
                   +-----------------------------------------+
              
              . egen CITY = concat(city?),  p(,) 
               
              . list 
              
                   +-----------------------------------------------------------+
                   |              city         city1   city2              CITY |
                   |-----------------------------------------------------------|
                1. |         Tulsa, OK         Tulsa      OK         Tulsa, OK |
                2. |       Wichita, KS       Wichita      KS       Wichita, KS |
                3. | Sioux Falls, SD,    Sioux Falls      SD   Sioux Falls, SD |
                   +-----------------------------------------------------------+

              Comment


              • #8
                Thanks. That's another good solution. My initial (and continuing) confusion relates to the failure of the initial code to work as it seems like it should:


                Code:
                replace city=subinstr(substr(city, -2, 2), ", ", "", .) if substr(city, -2, 2)==", "

                It seems to me that the above code is telling Stata to replace all iterations of ", " occurring in the last two characters of city with "", for records wherein said characters are ", ". I'm still not certain why there seems to be a disconnect between what I think I'm asking Stata to do and what it's actually doing. Clearly my understanding of the subinstr function needs brushing up.

                Comment


                • #9
                  Reese explained the main point in #5. Here's another wording. You are

                  1. selecting substr(city, -2, 2) if it's equal to ", "

                  and then

                  2. zapping those characters.

                  So, in those observations you select, the new value is first ", " and then "" (empty).

                  I think what you're misunderstanding is replace. With this syntax replace doesn't leave the rest of the string value unchanged in the observations you select. The expression you feed replace is the (whole of the) definition of the variable's new values in the observations used.

                  What you're reaching for, I think, is
                  Code:
                  replace city=substr(city, 1, length(city) - 2) + subinstr(substr(city, -2, 2), ", ", "", .) if substr(city, -2, 2)==", " 
                  except that this is equivalent to
                  Code:
                  replace city=substr(city, 1, length(city) - 2) if substr(city, -2, 2)==", "
                  which is precisely what was recommended in #2.








                  Comment


                  • #10
                    Ok, it's starting to click now. I think you're right that it's the interaction between replace and subinst that was getting me hung up.

                    Comment


                    • #11
                      One repeated tip to understand code is to work through using display. Mata can also be great, but there is no free lunch there: you have to know a little Mata before you can use it.

                      Here is a dialogue. As we end with an empty string, that's hard to discern from spaces, so an extra device is to wrap the beast in pipe symbols. Also, parentheses are sometimes needed to force display to evaluate the whole of a complicated expression before it shows the result.

                      Code:
                      . di "Sioux Falls, SD, "
                      Sioux Falls, SD,
                      
                      . di substr("Sioux Falls, SD, ", -2, 2)
                      ,
                      
                      . di ("|" + substr("Sioux Falls, SD, ", -2, 2) + "|")
                      |, |
                      
                      . di ("|" + subinstr(substr("Sioux Falls, SD, ", -2, 2), ", ", "", .) + "|")
                      ||

                      Comment

                      Working...
                      X