I have fasta file including strings of DNA. I want to generate a negative dataset from positive data. One way is to exclude some specific sequences from my data and then shuffle the data.
Let's say my dataset is a list:
1)
DNAlst:
ACTATACGCTAATATCGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTACCGCA
ATATCGATCGCAAAAATCG
I want to exclude these sequences:
ATAT,CGCA
so the result would be:
ACTATACGCTACGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTAC
CGATAAAAATCG
2)
then I want to shuffle my sequence by a specific length (e.g. 5). It means to shuffle DNA string by part (5-mer) with length of 5. For example:
ATATACGCGAAAAAATCTCTC => result after shuffle by 5 ==> AAAAACTCTCCGCAATATA
I would be thankful you if tell me how to do this in R.
use
stringi
package:Now the shuffle part.
seq
andstri_sub
functions will be useful. First we need to 'cut' our DNA seq into pieces of at most 5 char long. seq function give us starting pointsstri_sub
string from indexes generated byseq
of length 5sample
will shuffle our vector andstri_flatten
paste it together into one string.