I have aligned 2 DNAStringSet, and use the Biostring package. I would like to extract all the portions of the sequences that match perfectly and are at least 10 nucleotides (nt) long, as well as the position of the match in both the pattern and the query sequence.
Reproducible example
library('Biostrings')
set.seed(123)
dna_alphabet <- DNA_ALPHABET[1:4]
seq1_random <- sample(dna_alphabet, 100, replace = TRUE)
seq2_random <- sample(dna_alphabet, 100, replace = TRUE)
match_12 <- sample(dna_alphabet, 12, replace = TRUE)
match_15 <- sample(dna_alphabet, 15, replace = TRUE)
match_23 <- sample(dna_alphabet, 23, replace = TRUE)
seq1_random[15:(15+11)] <- match_12
seq2_random[17:(17+11)] <- match_12
seq1_random[41:(41+14)] <- match_15
seq2_random[40:(40+14)] <- match_15
seq1_random[70:(70+22)] <- match_23
seq2_random[73:(73+22)] <- match_23
seq1 <- DNAString(paste(seq1_random, collapse = ""))
seq2 <- DNAString(paste(seq2_random, collapse = ""))
alignment <- pairwiseAlignment(pattern = seq1, subject = seq2, type = "global")
alignment
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: GGGCGCCCGATCCA--CCTATTCGTTTGTCGCACGTCAGGATATTTCACGTGAGTCATCACAAT----TGACAAGACGACCATTTCTAACATGAATTCACTAACGG
subject: ATCATCAGGTTCTGACCCTATTCGTTTGA---ATGCCCTCCCATTTCACGTGAGTCAAGGGAGGCCCGTGGCACTACGACCATTTCTAACATGAATTCGGTAT---
score: -138.1462
Expected Result
> df_results
perfect_match start_pos_pattern end_pos_pattern start_pos_subject end_pos_subject
1 CCTATTCGTTTG 15 26 17 28
2 ATTTCACGTGAGTCA 41 55 40 54
3 ACGACCATTTCTAACATGAATTC 70 92 73 95
My actual alignment is >15,000 nt long, so I need to find an efficient way to proceed. I am more familiar with the {tidyverse} than {data.table} options, if possible.
Current attempt:
What I found so far is: Extract the aligned sequences:
aligned_pattern_str <- as.character(alignment@pattern)
aligned_pattern_str
[1] "GGGCGCCCGATCCA--CCTATTCGTTTGTCGCACGTCAGGATATTTCACGTGAGTCATCACAAT----TGACAAGACGACCATTTCTAACATGAATTCACTAA"
aligned_subject_str <- as.character(alignment@subject)
aligned_subject_str
[1] "ATCATCAGGTTCTGACCCTATTCGTTTGA---ATGCCCTCCCATTTCACGTGAGTCAAGGGAGGCCCGTGGCACTACGACCATTTCTAACATGAATTCGGTAT"
Then: I use a rolling window to compare the first character in aligned_pattern_str and the first character in aligned_subject_str. If they don´t match, I compare the second of each etc until they match. Once they match: I keep comparing the next characters until there is a mismtach. Then, if the match is >10 nucleotide, I report it, otherwise Idont. And I keep proceeding until the entire sequences have been covered.
So my code to that step:
match_start <- NULL
match_length <- 0
matches <- list()
for (i in 1:nchar(aligned_pattern_str)) {
if (substr(aligned_pattern_str, i, i) == substr(aligned_subject_str, i, i) && substr(aligned_pattern_str, i, i) != "-") && {
if (is.null(match_start)) {
match_start <- i
}
match_length <- match_length + 1
} else {
if (!is.null(match_start) && match_length > 10) {
matches[[length(matches) + 1]] <- list(start = match_start, end = i - 1, length = match_length)
}
match_start <- NULL
match_length <- 0
}
}
if (!is.null(match_start) && match_length > 10) {
matches[[length(matches) + 1]] <- list(start = match_start, end = nchar(aligned_pattern_str), length = match_length)
}
for (match in matches) {
cat(sprintf("Match from %d to %d, Length: %d\n", match$start, match$end, match$length))
}
>
Match from 17 to 28, Length: 12
Match from 43 to 57, Length: 15
Match from 76 to 98, Length: 23
Problems
It is not the final answer:
- at this point, I only have the position in the aligned pattern sequence. Which means that the mismatch that occured before the match, noted - -, are counted in the positions. So instead of having the positions 17-28, 40-54, and 73-95 in the subject sequence, the code returns 17-28, 43-57, and 76-98 (cf the expected results versus the results from the current lead).
- the current attempt does not provide the matched positions in the subject sequence.
- Finally, I don´t have a dataframe but a list at this point.
Here is what I eventually found. It might not be the most efficient way to go,especially over long alignments, and it uses a loop, which I was told to avoid in R. But it gets the job done:
First, I extract the aligned sequences:
I create 3 variables:
Then I create a function to correct the positions with the number of gaps:
And I loop over the aligned sequences:
I then print the list of matches:
And I transform the list into a data frame:
The results is: