Patterns across multiple variables

Igor Paploski

Join Date: Oct 2014

Posts: 174
#1

Patterns across multiple variables

18 Jun 2018, 15:28

First of all, I apologize for the long post. I'm trying to explain the problem to avoid a XY situation.

I'm using Stata 12, which has a built in limit of 244 characters for string variables. I do plan on upgrading to Stata 15 or 16 (once this gets released), which would allow different solutions (since character limit was increased), but this is just not my reality right now.

I'm working with a genetic sequence database. Each observation is composed of an id variable and a string variable (called "genseq" after genetic sequence) that contains a list of letters (nitrogen bases) that is between 598-606 characters long. I need to find specific patters of letters anywhere in the genetic sequence and record at which position (counted by characters from left to right) in the genetic sequence the pattern began.

Since I cannot work with variables which content is longer than 244 characters, using Excel I converted the X many characters genseq variable into X variables, each containing a single letter. My database then looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str2 id str1(nb1 nb2 nb3 nb4 nb5 nb6 nb7 nb8 nb9 nb10 nb11 nb12 nb13 nb14 nb15 nb16 nb17 nb18 nb19 nb20 nb21 nb22 nb23) "1a" "A" "T" "G" "A" "C" "G" "C" "G" "T" "T" "A" "A" "T" "G" "C" "T" "C" "G" "A" "C" "C" "G" "C" "2a" "A" "T" "G" "T" "T" "G" "G" "T" "C" "A" "A" "C" "T" "G" "C" "T" "T" "G" "A" "T" "C" "G" "C" "3a" "A" "T" "G" "T" "C" "G" "G" "G" "G" "A" "G" "A" "T" "G" "C" "T" "T" "G" "A" "T" "C" "G" "C" "4a" "A" "T" "G" "T" "T" "A" "C" "G" "C" "G" "T" "G" "A" "A" "C" "G" "C" "G" "T" "C" "C" "G" "C" "5a" "A" "T" "G" "T" "C" "C" "G" "C" "G" "G" "A" "A" "T" "G" "C" "T" "T" "G" "A" "C" "C" "G" "C" end

The sequence of letters I'm looking for are:
ACGCGT
(there are other sequences, but the solution for one can probably be extended to the others)

In order to look for the sequence of letters (ACGCGT), I planned on running a code such as:

Code:

forvalues i = 1/23{ gen position`i' = 1 if (nb`i'=="A" & nb`i'+1=="C" & nb`i'+2=="G" & nb`i'+3=="C"& nb`i'+4=="G"& nb`i'+5=="T") }

Which would return the position (variable, or column number) in which the sequence began. The code above does not work, and I believe it's because Stata does not understand nb`i'+n.

My question for you is:
1. Can you think of another solution to the problem I have?
2. Is there a way to code for what I'm looking for (values of neighboring variables - nb`i'+n)?
3. While in id 1a the sequence ACGCGT appears only once, in id 4a it appears twice (beginning at nb6 and nb14). The solution code should be able to handle this situations - return the positions in which the sequence appear, even if more than once (the code above ideally would do that by creating position6 and position14, that should both be == 1). Can you think of a way of dealing with such situations?

Any suggestion would be appreciated!

Please remember that the above example is a small snippet of my database, the original one is ~2200 observations and 607 variables long, the Stata version I have just can't deal with variables with such a long content. This is crucial, otherwise I could look for the specific sequence of letters using the advice shared in this topic, and could perhaps deal with multiple appearances of ACGCGT using advice such as this.

Thanks
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

18 Jun 2018, 18:41

First, but not best, for

Code:

nb`i'+1

write instead

Code:

nb`=`i'+1'

Now, starting from your data, here's a different approach that might point you to an easier-to-generalize solution that will handle multple matches in a single id.

Code:

reshape long nb, i(id) j(seq) generate ACGCGT = 1 local f 0 foreach base in A C G C G T { by id (seq): replace ACGCGT = ACGCGT & nb[_n+`f++']=="`base'" }

Code:

. list id seq if ACGCGT, noobs +----------+ | id seq | |----------| | 1a 4 | | 4a 6 | | 4a 14 | +----------+

Last edited by William Lisowski; 18 Jun 2018, 19:02.
1 like
Comment
Igor Paploski

Join Date: Oct 2014

Posts: 174
#3

18 Jun 2018, 20:11

Hi William Lisowski,

Thank you for your suggestions. Both work great, even though the second takes about 5 mins to do the reshape of the entire database. It' time for a break anyway

Thanks again.

Best;
Comment

Announcement

Patterns across multiple variables

Comment

Comment