Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Delete everything after a specific string

    Hi!

    I'm dealing with some data where my variable "address" is contaminated also by the population where my observations live. Here you have some examples (poblacion= village):
    CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT
    PASAJE ACHAO, POBLACION LA LAGUNA
    VICENTE REYES 889 POBLACION PATRONA DE CHILE

    I need to remove POBLACION and everything that goes after it, so in the first case, I would need to remove POBLACION ARTURO PRAT.

    Is this possible in Stata?

    Daniel.

  • #2
    Code:
    clear 
    input str46 tochange
    "CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT"
    "PASAJE ACHAO, POBLACION LA LAGUNA"
    "VICENTE REYES 889 POBLACION PATRONA DE CHILE"
    end 
    
    gen toget = trim(substr(tochange, 1, strpos(tochange, "POBLACION") - 1)) 
    
    list toget 
    
         +-------------------------+
         |                   toget |
         |-------------------------|
      1. | CALLE CONCHA Y TORO 120 |
      2. |           PASAJE ACHAO, |
      3. |       VICENTE REYES 889 |
         +-------------------------+

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Code:
      clear
      input str46 tochange
      "CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT"
      "PASAJE ACHAO, POBLACION LA LAGUNA"
      "VICENTE REYES 889 POBLACION PATRONA DE CHILE"
      end
      
      gen toget = trim(substr(tochange, 1, strpos(tochange, "POBLACION") - 1))
      
      list toget
      
      +-------------------------+
      | toget |
      |-------------------------|
      1. | CALLE CONCHA Y TORO 120 |
      2. | PASAJE ACHAO, |
      3. | VICENTE REYES 889 |
      +-------------------------+
      Thank you Nick. Is there a way to do it without having to generate another variable? Just curiosity.

      Daniel.

      Comment


      • #4
        You can use replace — with the danger that you get it wrong or change your mind.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          You can use replace — with the danger that you get it wrong or change your mind.
          Thank you Nick, as helpful as always.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            You can use replace — with the danger that you get it wrong or change your mind.
            I have tried to add replace since I have just discovered that there are also other population units like "Villa" that should be removed in the same way we did with Poblacion. But it removes all the observations of address, probably because I don't understand quite well the logic behind the command.

            Code:
            replace address=trim(substr(address, 1, strpos(address, ", POBLACION") -1))
            replace address=trim(substr(address, 1, strpos(address, "SECTOR") -1))
            replace address=trim(substr(address, 1, strpos(address, "VILLA") -1))
            Daniel.

            Comment


            • #7
              Indeed. If a string is not found, then strpos() returns 0 so you need to make the replacement conditional on a positive result.

              See

              Code:
              help strpos()

              Comment


              • #8
                Some alternative code using regular expressions:
                Code:
                clear 
                input str46 tochange
                "CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT"
                "PASAJE ACHAO, POBLACION LA LAGUNA"
                "VICENTE REYES 889 POBLACION PATRONA DE CHILE"
                "MI VILLA NUMERO 990 EN CHILE"
                "PASAJE ACHAO, LA LAGUNA"
                "PASAJE SECTOR ACHAO, LA LAGUNA"
                end 
                
                gen toget = ustrregexra(tochange,"\s(POBLACION|VILLA|SECTOR)\s.+$","")
                so that you get:

                Code:
                . li, noobs sep(0)
                  +-------------------------------------------------------------------------+
                  |                                      tochange                     toget |
                  |-------------------------------------------------------------------------|
                  | CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT   CALLE CONCHA Y TORO 120 |
                  |             PASAJE ACHAO, POBLACION LA LAGUNA             PASAJE ACHAO, |
                  |  VICENTE REYES 889 POBLACION PATRONA DE CHILE         VICENTE REYES 889 |
                  |                  MI VILLA NUMERO 990 EN CHILE                        MI |
                  |                       PASAJE ACHAO, LA LAGUNA   PASAJE ACHAO, LA LAGUNA |
                  |                PASAJE SECTOR ACHAO, LA LAGUNA                    PASAJE |
                  +-------------------------------------------------------------------------+
                Last edited by Hemanshu Kumar; 12 Nov 2022, 04:18.

                Comment


                • #9
                  Regular expressions are serious players here. For those who find them intimidating or awkward, other functions remain possible.

                  Code:
                  gen where = 0 
                  
                  foreach w in POBLACION VILLA SECTOR {
                  
                        replace where = strpos(tochange, "`w'")  
                  
                  } 
                  
                  gen toget = cond(where > 0, substr(tochange, 1, where - 1), tochange)
                  which assumes that each key word occurs at most once.

                  Comment


                  • #10
                    The code in #9 has two issues:
                    • it will pick up even those situations where the letters are embedded in another word, e.g. "SECTORES"
                    • the loop will reset the value of where to the location of "SECTOR" in the string, meaning that it misses every case except those where the string contains SECTOR
                    The following modification fixes both issues. It also accounts for situations where more than one keyword appears in the string. I retain the code in #9 to compare the results.

                    Code:
                    clear
                    input str46 tochange
                    "CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT"
                    "PASAJE ACHAO, POBLACION LA LAGUNA"
                    "VICENTE REYES 889 POBLACION PATRONA DE CHILE"
                    "MI VILLA NUMERO 990 EN CHILE"
                    "PASAJE ACHAO, LA LAGUNA"
                    "PASAJE SECTOR ACHAO, LA LAGUNA"
                    "NUEVOS SECTORES EN CHILE"
                    "MI VILLA DE NOVA SECTOR 20 CHILE"
                    "SECTOR 20 CHILE"
                    end 
                    
                    gen where = 0
                    gen where2 = .
                    foreach w in POBLACION VILLA SECTOR {
                          replace where = strpos(tochange, "`w'")
                          replace where2 = min(where2, strpos(tochange, " `w' ")) if strpos(tochange," `w' ") > 0
                          replace where2 = 1 if strpos(tochange,"`w' ") == 1
                    } 
                    
                    replace where2 = 0 if missing(where2)
                    gen toget = cond(where > 0, substr(tochange, 1, where - 1), tochange)
                    gen toget2 = cond(where2 > 0, substr(tochange, 1, where2 - 1), tochange)
                    drop where where2
                    The results are:
                    Code:
                    . li, noobs sep(0)
                    
                      +--------------------------------------------------------------------------------------------------------------------------+
                      |                                      tochange                                           toget                     toget2 |
                      |--------------------------------------------------------------------------------------------------------------------------|
                      | CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT   CALLE CONCHA Y TORO 120 POBLACION ARTURO PRAT    CALLE CONCHA Y TORO 120 |
                      |             PASAJE ACHAO, POBLACION LA LAGUNA               PASAJE ACHAO, POBLACION LA LAGUNA              PASAJE ACHAO, |
                      |  VICENTE REYES 889 POBLACION PATRONA DE CHILE    VICENTE REYES 889 POBLACION PATRONA DE CHILE          VICENTE REYES 889 |
                      |                  MI VILLA NUMERO 990 EN CHILE                    MI VILLA NUMERO 990 EN CHILE                         MI |
                      |                       PASAJE ACHAO, LA LAGUNA                         PASAJE ACHAO, LA LAGUNA    PASAJE ACHAO, LA LAGUNA |
                      |                PASAJE SECTOR ACHAO, LA LAGUNA                                         PASAJE                      PASAJE |
                      |                      NUEVOS SECTORES EN CHILE                                         NUEVOS    NUEVOS SECTORES EN CHILE |
                      |              MI VILLA DE NOVA SECTOR 20 CHILE                               MI VILLA DE NOVA                          MI |
                      |                               SECTOR 20 CHILE                                                                            |
                      +--------------------------------------------------------------------------------------------------------------------------+
                    The regular expression equivalent, incidentally, is the single line:
                    Code:
                    gen toget3 = ustrregexra(tochange,"(^|\s)(POBLACION|VILLA|SECTOR)\s.+$","")
                    I am myself trying to learn regular expressions, because as awkward (and yes, intimidating!) as they are, they are both extremely powerful and usually lead to much shorter code even when the old string functions can get the job done.
                    Last edited by Hemanshu Kumar; 12 Nov 2022, 05:54.

                    Comment


                    • #11
                      Thank you so much to both!

                      For the other users, I've solved my issue using the following code:


                      Code:
                      foreach var of varlist address{
                      
                      replace `var' = subinstr(`var', "SECTOR", "POBLACION",.)
                      replace `var' = subinstr(`var', "VILLA", "POBLACION",.)    
                      replace `var' = subinstr(`var', "LOCALIDAD", "POBLACION",.)    
                      
                      replace `var' = subinstr(`var', ", POBLACION", "POBLACION",.)    
                      
                      }
                      
                      gen address2=trim(substr(address, 1, strpos(address, "POBLACION") -1))
                      egen address2dummy=group(address2)
                      replace address=address2 if address2dummy!=.
                      First, I rename the other all the location units as LOCATION. Then I apply the code suggested by Nick and I make the replacement (replace) generating an ID and replace when !=.
                      This is probably not the simplest solution but it works!

                      Daniel.

                      Comment


                      • #12
                        Daniel Perez Parra A couple of warnings and a question:
                        • your code will wrongly pick up situations where VILLA, SECTOR or POBLACION are part of a longer word.
                        • your code will not work if any of those words occur at the start of the string.
                        I hope you are sure that neither of these occur in your data.

                        Also, I am not clear why you create address2dummy. Here it seems to be used to identify observations where address2 is non-missing. If this is its only purpose, you can substitute for the last two lines:
                        Code:
                        replace address = address2 if !missing(address2)

                        Comment


                        • #13
                          Originally posted by Hemanshu Kumar View Post
                          Daniel Perez Parra A couple of warnings and a question:
                          • your code will wrongly pick up situations where VILLA, SECTOR or POBLACION are part of a longer word.
                          • your code will not work if any of those words occur at the start of the string.
                          I hope you are sure that neither of these occur in your data.

                          Also, I am not clear why you create address2dummy. Here it seems to be used to identify observations where address2 is non-missing. If this is its only purpose, you can substitute for the last two lines:
                          Code:
                          replace address = address2 if !missing(address2)
                          Indeed, my data doesn't contain any of those 2 cases, but it is good to point that out.

                          Regarding your second comment, you're right. Thank you for providing an simpler code.

                          Comment


                          • #14
                            Thanks for the corrections in #10. The replacement should be conditional on the variable being positive. As for false positives, yes, that could be a problem. One can always imagine small problems. The code in #8 could fail whenever spelling was even slightly different.

                            Comment

                            Working...
                            X