Help with sequencing a string variable

Dana Sarnak

Join Date: Dec 2020

Posts: 14
#1

Help with sequencing a string variable

07 Feb 2022, 11:38

Hello!
I have a string variable, new, which represents 36 months of a woman's reproductive status, and includes pregnancies, births, contraceptive use, and non-use.

Contraceptive use is indicated through a character that relates to a specific method, potential values include : 1 2 3 4 5 6 7 8 9 W N L C E S.

Pregnancies are indicated through "P" for the months pregnant and "B" for birth, "T" for termination.

For example, here is a woman's value (read from right to left).

2222222200000LLLLLLBPPPPPPPPP000000

You can see she had 6 months of non-use, followed by 9 months of pregnancy and a birth. After that she used method "L" for 6 months and then did not use for 5 months (0s), and then used method "2" for 8 months.

I am trying to create a variable which indicates how many pregnancies (uninterrupted sequences of Ps) and contraceptive use episodes she had. In this case she had 1 pregnancy and 2 use episodes. So, it could also be 2 new variables, one for the number of pregnancies in the period and one for the number of use sequences in the period. I think in R there is sequencing programming such as entropy which can do this easily but I am having trouble finding a similar program or thinking up a work around in Stata.

Many many thanks in advance!
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 9944

07 Feb 2022, 12:44

You can generalize the following where you just change the highlighted:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str100 status
"2222222200000LLLLLLBPPPPPPPPP000000"
end

gen counter= ustrregexra(status, "P([^P])", "P,$1")
replace counter= ustrregexra(counter, "([P]$)", "$1,")
gen pregnancy= length(counter)- length(status)
drop counter

Res.:

Code:

. l

     +------------------------------------------------+
     |                              status   pregna~y |
     |------------------------------------------------|
  1. | 2222222200000LLLLLLBPPPPPPPPP000000          1 |
     +------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35207

07 Feb 2022, 13:15

Another approach:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str100 status
"2222222200000LLLLLLBPPPPPPPPP000000"
"PPPPPPPPP000000000PPPPPPPPP00000000"
end

* ssc install moss 
moss status, match("(P+)") regex 

gen count = 0
foreach v of var _match* {
replace count = count + 1 if `v' != ""
}

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 782

08 Feb 2022, 03:39

Code:

gen status2 = ustrregexra(ustrregexra(status,"[1-9WNLCES]","C"),"(.)\1+","$1") 
gen contraceptive = ustrlen(ustrregexra(status2,"[^C]",""))
gen pregnancies   = ustrlen(ustrregexra(status2,"[^P]",""))

Code:

. list , abbrev(32)

     +-----------------------------------------------------------------------------+
     |                              status   status2   contraceptive   pregnancies |
     |-----------------------------------------------------------------------------|
  1. | 2222222200000LLLLLLBPPPPPPPPP000000    C0CBP0               2             1 |
  2. | PPPPPPPPP000000000PPPPPPPPP00000000      P0P0               0             2 |
     +-----------------------------------------------------------------------------+

Comment

Dana Sarnak

Join Date: Dec 2020

Posts: 14
#5

08 Feb 2022, 09:41

These are all great fixes, thank you so so much!!
Comment

Announcement

Help with sequencing a string variable

Comment

Comment

Comment

Comment