I have fasta file including strings of DNA. I want to generate a negative dataset from positive data. One way is to exclude some specific sequences from my data and then shuffle the data.
Let's say my dataset is a list:
1)
DNAlst:
ACTATACGCTAATATCGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTACCGCA
ATATCGATCGCAAAAATCG
I want to exclude these sequences:
ATAT,CGCA
so the result would be:
ACTATACGCTACGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTAC
CGATAAAAATCG
2) then I want to shuffle my sequence by a specific length (e.g. 5). It means to shuffle DNA string by part (5-mer) with length of 5. For example:
ATATACGCGAAAAAATCTCTC => result after shuffle by 5 ==> AAAAACTCTCCGCAATATA
I would be thankful you if tell me how to do this in R.
use
stringipackage:Now the shuffle part.
seqandstri_subfunctions will be useful. First we need to 'cut' our DNA seq into pieces of at most 5 char long. seq function give us starting pointsstri_substring from indexes generated byseqof length 5samplewill shuffle our vector andstri_flattenpaste it together into one string.