excluding specific strings (DNA string) from background (DNA sequence) and shuffling (i.e. generating negative set from positive DNA sequence)

Question

excluding specific strings (DNA string) from background (DNA sequence) and shuffling (i.e. generating negative set from positive DNA sequence)

169 views Asked by Cina At 17 November 2014 at 08:38

I have fasta file including strings of DNA. I want to generate a negative dataset from positive data. One way is to exclude some specific sequences from my data and then shuffle the data.
Let's say my dataset is a list:

1)
DNAlst:
ACTATACGCTAATATCGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTACCGCA
ATATCGATCGCAAAAATCG

I want to exclude these sequences:

ATAT,CGCA

so the result would be:

ACTATACGCTACGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTAC
CGATAAAAATCG

2) then I want to shuffle my sequence by a specific length (e.g. 5). It means to shuffle DNA string by part (5-mer) with length of 5. For example:

ATATACGCGAAAAAATCTCTC => result after shuffle by 5 ==> AAAAACTCTCCGCAATATA

I would be thankful you if tell me how to do this in R.

Original Q&A

There are 1 answers

**bartektartanus** · Accepted Answer · 2014-12-23T14:36:17+00:00

use stringi package:

dna <- c("ACTATACGCTAATATCGATCTACGTACGATCG","CAGCAGCAGCGAGACTATCCTACCGCA","ATATCGATCGCAAAAATCG")

# stri_replace function replaces strings ATAT and CGCA for empty string
stri_replace_all_regex(dna, "ATAT|CGCA","")

Now the shuffle part. seq and stri_sub functions will be useful. First we need to 'cut' our DNA seq into pieces of at most 5 char long. seq function give us starting points

seq(1,24,5)
## [1]  1  6 11 16 21
seq(1,27,5)
## [1]  1  6 11 16 21 26

stri_sub string from indexes generated by seq of length 5

y <- stri_sub(dna[1], seq(from=1,to=stri_length(dna[1]),by=5), length = 5)
y
## [1] "ACTAT" "ACGCT" "AATAT" "CGATC" "TACGT" "ACGAT" "CG"

sample will shuffle our vector and stri_flatten paste it together into one string.

stri_flatten(y[sample(length(y))])
## [1] "TACGTACGATCGATCAATATACGCTACTATCG"

TechQA.

excluding specific strings (DNA string) from background (DNA sequence) and shuffling (i.e. generating negative set from positive DNA sequence)

There are 1 answers

Related Questions in R

Related Questions in STRING

Related Questions in REPLACE

Related Questions in DNA-SEQUENCE

Popular Questions

Popular Tags

Trending Questions