Split after different string

Ylenia Curci

Join Date: Sep 2017

Posts: 72
#1

Split after different string

29 Feb 2024, 14:25

Hello,
I would like to slip a string variable right after a certain word appear, but the word is not the same for each observation. This words is recorded in the variable "parcing". The variable I want should look like this. "hola adios"
"ciao adios"
"yes bye ciao adios hola"
"Thank you

Clear
input strL content str12 parcing
"hello yes bye ciao hola adios" "ciao"
"hello hola yes ehy bye ciao adios" "bye"
"hello yes bye ciao adios hola" "hello"
end
Tags: None
George Ford

Join Date: Aug 2014

Posts: 3036
#2

29 Feb 2024, 14:58

I'm not sure what you're after, but you'd likely use subinstr with strpos to mark the location of the word, adding digits the length of the word you want to drop it after.
Comment
Ylenia Curci

Join Date: Sep 2017

Posts: 72
#3

29 Feb 2024, 15:28

Thank you George, I thought about that but I was hoping there is a way to write only one line of code like "split content, p (parsing)", which I know doesn't exist, but you know, hope springs eternal
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35211

29 Feb 2024, 15:45

This is just more or less what split does, look for the parsing string and pick what comes before and what comes after.

I added an example where the parsing string is at the end and one where it doesn't occur. You might want different rules for that last case.

Also consider whether you want to trim any nleading and trailing spaces.

Code:

clear
input strL content str12 parsing
"hello yes bye ciao hola adios" "ciao"
"hello hola yes ehy bye ciao adios" "bye"
"hello yes bye ciao adios hola" "hello"
"newt toad frog" "frog"
"dinosaur newt toad" "frog"
end

gen where = strpos(content, parsing)
gen before = substr(content, 1, cond(where == 0, ., where - 1))
gen after = substr(content, where + strlen(parsing), .) if where 

list, sep(0)

     +------------------------------------------------------------------------------------------------------+
     |                           content   parsing   where                before                      after |
     |------------------------------------------------------------------------------------------------------|
  1. |     hello yes bye ciao hola adios      ciao      15        hello yes bye                  hola adios |
  2. | hello hola yes ehy bye ciao adios       bye      20   hello hola yes ehy                  ciao adios |
  3. |     hello yes bye ciao adios hola     hello       1                          yes bye ciao adios hola |
  4. |                    newt toad frog      frog      11            newt toad                             |
  5. |                dinosaur newt toad      frog       0    dinosaur newt toad                            |
     +------------------------------------------------------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35211
#5

01 Mar 2024, 04:59

Two additional notes:

1. The code takes the problem in #1 literally and splits on the first occurrence of the parsing string. Unlike with split the result will not be three or more variables if the parsing string occurs twice or more.

2. split was added in Stata 8. As documented in the manual entry it builds on my work and also on earlier work jointly with Michael Blasnik. The idea that the parsing string might differ and need to be recorded in a variable didn't arise at the time, or since that I can recall. Be that as it may, a full generalization of split might build on the existing code, but I am not volunteering.
1 like
Comment

Announcement

Split after different string

Comment

Comment

Comment

Comment