I have fasta file including strings of DNA. I want to generate a negative dataset from positive data. One way is to exclude some specific sequences from my data and then shuffle the data.
Let's say my dataset is a list:

1)
DNAlst:
ACTATACGCTAATATCGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTACCGCA
ATATCGATCGCAAAAATCG

I want to exclude these sequences:

ATAT,CGCA

so the result would be:

ACTATACGCTACGATCTACGTACGATCG
CAGCAGCAGCGAGACTATCCTAC
CGATAAAAATCG

2) then I want to shuffle my sequence by a specific length (e.g. 5). It means to shuffle DNA string by part (5-mer) with length of 5. For example:

ATATACGCGAAAAAATCTCTC => result after shuffle by 5 ==> AAAAACTCTCCGCAATATA

I would be thankful you if tell me how to do this in R.

1

There are 1 answers

0
bartektartanus On BEST ANSWER

use stringi package:

dna <- c("ACTATACGCTAATATCGATCTACGTACGATCG","CAGCAGCAGCGAGACTATCCTACCGCA","ATATCGATCGCAAAAATCG")

# stri_replace function replaces strings ATAT and CGCA for empty string
stri_replace_all_regex(dna, "ATAT|CGCA","")

Now the shuffle part. seq and stri_sub functions will be useful. First we need to 'cut' our DNA seq into pieces of at most 5 char long. seq function give us starting points

seq(1,24,5)
## [1]  1  6 11 16 21
seq(1,27,5)
## [1]  1  6 11 16 21 26 

stri_sub string from indexes generated by seq of length 5

y <- stri_sub(dna[1], seq(from=1,to=stri_length(dna[1]),by=5), length = 5)
y
## [1] "ACTAT" "ACGCT" "AATAT" "CGATC" "TACGT" "ACGAT" "CG"   

sample will shuffle our vector and stri_flatten paste it together into one string.

stri_flatten(y[sample(length(y))])
## [1] "TACGTACGATCGATCAATATACGCTACTATCG"