How can I find the complement of a subset of a DNA sequence using a logical index?

229 views Asked by At

I have a DNA sequence, its length for example is m*4n:

B = 'GATTAACTACACTTGAGGCT...';

I have also a vector of real numbers X = {xi, i = 1..m*4n}, and use mod(X,1) to keep them in the range [0,1]. For example:

X = [0.223 0.33 0.71 0.44 0.91 0.32 0.11 ....... m*4n];

I then need to transform X into a binary vector by applying the function:

f(x)={0  ,0 < X(i,j) ≤ 0.5;  1 ,0.5 < X(i,j) ≤ 1;)

The output according the previous values will be like X = [0010100 ....]. If X(i,j)==1, then B(i,j) is complemented, otherwise it is unchanged. In this context, the complement is the matching base-pair (i.e. A->T, C->G, G->C, and T->A).

This is the code I've tried so far, with no success:

%%maping X chaotic sequence from real numbers to binary sequence using threshold function
 X = v(:,3); 
 X(257)=[];
 disp (X);
 mode (X,1);
 for i=1
    for j=1:256
 if ((X(i,j)> 0) && (X(i,j)<= .5))
     X(i,j) = 0;
 elseif ((X(i,j)> .5) && (X(i,j)<= 1)) 
     X(i,j) = 1;
 end
    end
 end
 disp(X);

How can I properly perform the indexing and complement?

1

There are 1 answers

3
gnovice On BEST ANSWER

Given a sample base-pair sequence stored as a character array:

B = 'GATTAACT';

And a sample vector of numeric values the same length as B:

X = [0.223 0.33 0.71 0.44 0.91 0.32 0.11 1.6];

Then there is a fairly straightforward solution...

First, your use of the mod function implies you want to use only the fractional part of each value in X. This is how you wold do that:

>> X = mod(X, 1)
X =
    0.2230    0.3300    0.7100    0.4400    0.9100    0.3200    0.1100    0.6000

Next, you should give the documentation on vectorization a read. It will teach you that a for loop can be avoided for many operations in MATLAB. In particular, applying a logical test to your vector X can be done like so:

>> index = (X > 0.5)
index =
    0   0   1   0   1   0   0   1

And index is now a logical index the same length as X with ones (i.e. true) for each value greater than 0.5. You now want to get the characters corresponding to those indices in B, change them to their complement, then place them back in B. You can do this using a little trick in MATLAB whereby a character is converted to its ASCII numeric value when used an as index:

>> compMap = '';  % Initialize to an empty string
>> compMap('ACGT') = 'TGCA'
compMap =
                                                                T G   C            A

Notice the characters 'TGCA' get placed in indices 65, 67, 71, and 84 of compMap (i.e. the ASCII values for 'ACGT'). The rest are blanks. Now, you can replace the indexed base-pairs with their complements by simply doing this:

>> B(index) = compMap(B(index))
B =
GAATTACA

Putting this all together, here's the solution:

B = '...';     % Whatever your sequence is
X = [...];     % Whatever your values are
compMap = '';
compMap('ACGT') = 'TGCA';      % Build a complement map
index = (mod(X, 1) > 0.5);     % Get your logical index
B(index) = compMap(B(index));  % Replace with complements