First of all, I apologize for the long post. I'm trying to explain the problem to avoid a XY situation.
I'm using Stata 12, which has a built in limit of 244 characters for string variables. I do plan on upgrading to Stata 15 or 16 (once this gets released), which would allow different solutions (since character limit was increased), but this is just not my reality right now.
I'm working with a genetic sequence database. Each observation is composed of an id variable and a string variable (called "genseq" after genetic sequence) that contains a list of letters (nitrogen bases) that is between 598-606 characters long. I need to find specific patters of letters anywhere in the genetic sequence and record at which position (counted by characters from left to right) in the genetic sequence the pattern began.
Since I cannot work with variables which content is longer than 244 characters, using Excel I converted the X many characters genseq variable into X variables, each containing a single letter. My database then looks like this:
The sequence of letters I'm looking for are:
ACGCGT
(there are other sequences, but the solution for one can probably be extended to the others)
In order to look for the sequence of letters (ACGCGT), I planned on running a code such as:
Which would return the position (variable, or column number) in which the sequence began. The code above does not work, and I believe it's because Stata does not understand nb`i'+n.
My question for you is:
1. Can you think of another solution to the problem I have?
2. Is there a way to code for what I'm looking for (values of neighboring variables - nb`i'+n)?
3. While in id 1a the sequence ACGCGT appears only once, in id 4a it appears twice (beginning at nb6 and nb14). The solution code should be able to handle this situations - return the positions in which the sequence appear, even if more than once (the code above ideally would do that by creating position6 and position14, that should both be == 1). Can you think of a way of dealing with such situations?
Any suggestion would be appreciated!
Please remember that the above example is a small snippet of my database, the original one is ~2200 observations and 607 variables long, the Stata version I have just can't deal with variables with such a long content. This is crucial, otherwise I could look for the specific sequence of letters using the advice shared in this topic, and could perhaps deal with multiple appearances of ACGCGT using advice such as this.
Thanks
I'm using Stata 12, which has a built in limit of 244 characters for string variables. I do plan on upgrading to Stata 15 or 16 (once this gets released), which would allow different solutions (since character limit was increased), but this is just not my reality right now.
I'm working with a genetic sequence database. Each observation is composed of an id variable and a string variable (called "genseq" after genetic sequence) that contains a list of letters (nitrogen bases) that is between 598-606 characters long. I need to find specific patters of letters anywhere in the genetic sequence and record at which position (counted by characters from left to right) in the genetic sequence the pattern began.
Since I cannot work with variables which content is longer than 244 characters, using Excel I converted the X many characters genseq variable into X variables, each containing a single letter. My database then looks like this:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str2 id str1(nb1 nb2 nb3 nb4 nb5 nb6 nb7 nb8 nb9 nb10 nb11 nb12 nb13 nb14 nb15 nb16 nb17 nb18 nb19 nb20 nb21 nb22 nb23) "1a" "A" "T" "G" "A" "C" "G" "C" "G" "T" "T" "A" "A" "T" "G" "C" "T" "C" "G" "A" "C" "C" "G" "C" "2a" "A" "T" "G" "T" "T" "G" "G" "T" "C" "A" "A" "C" "T" "G" "C" "T" "T" "G" "A" "T" "C" "G" "C" "3a" "A" "T" "G" "T" "C" "G" "G" "G" "G" "A" "G" "A" "T" "G" "C" "T" "T" "G" "A" "T" "C" "G" "C" "4a" "A" "T" "G" "T" "T" "A" "C" "G" "C" "G" "T" "G" "A" "A" "C" "G" "C" "G" "T" "C" "C" "G" "C" "5a" "A" "T" "G" "T" "C" "C" "G" "C" "G" "G" "A" "A" "T" "G" "C" "T" "T" "G" "A" "C" "C" "G" "C" end
ACGCGT
(there are other sequences, but the solution for one can probably be extended to the others)
In order to look for the sequence of letters (ACGCGT), I planned on running a code such as:
Code:
forvalues i = 1/23{ gen position`i' = 1 if (nb`i'=="A" & nb`i'+1=="C" & nb`i'+2=="G" & nb`i'+3=="C"& nb`i'+4=="G"& nb`i'+5=="T") }
My question for you is:
1. Can you think of another solution to the problem I have?
2. Is there a way to code for what I'm looking for (values of neighboring variables - nb`i'+n)?
3. While in id 1a the sequence ACGCGT appears only once, in id 4a it appears twice (beginning at nb6 and nb14). The solution code should be able to handle this situations - return the positions in which the sequence appear, even if more than once (the code above ideally would do that by creating position6 and position14, that should both be == 1). Can you think of a way of dealing with such situations?
Any suggestion would be appreciated!
Please remember that the above example is a small snippet of my database, the original one is ~2200 observations and 607 variables long, the Stata version I have just can't deal with variables with such a long content. This is crucial, otherwise I could look for the specific sequence of letters using the advice shared in this topic, and could perhaps deal with multiple appearances of ACGCGT using advice such as this.
Thanks
Comment