Subset sequence data in fasta file based on IDs stored in listed data frames

734 views Asked by At

I am trying to subset one FASTA file (containing multiple sequences) into several smaller ones based on IDs I stored in a list of data frames (and

I have a FASTA called fastafile like this:

 fastafile <- dput(fastafile)
structure(list(r1 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac", 
    r2 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag", 
    r3 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca", 
    r4 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg", 
    r5 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg", 
    r6 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"), .Names = c("r1", 
"r2", "r3", "r4", "r5", "r6"))

that I loaded using seqinr package like that:

fastafile <- read.fasta(file = "fastafile.fasta", 
                       seqtype = c("DNA","AA"),
                       as.string = TRUE, set.attributes = FALSE)

I load a table with my IDs and some expression values

GOI <- read.table(header = TRUE, text = "ID        T1        T2
1 r1 1.1 2.1
2 r2 1.2 2.2
3 r3 1.1 2.2
4 r4 1.2 2.1
5 r5 1.1 2.1
6 r6 1.2 2.2")

and split them into manageable subsets

GOI.split <- split(GOI,rep(1:3,each=2))

giving me

> GOI.split
$`1`
  ID  T1  T2
1 r1 1.1 2.1
2 r2 1.2 2.2

$`2`
  ID  T1  T2
3 r3 1.1 2.2
4 r4 1.2 2.1

$`3`
  ID  T1  T2
5 r5 1.1 2.1
6 r6 1.2 2.2

Now I would like to subset my sequences based on the IDs in the GOI.split data frames. In this mock example it should be two sequences per list item. To get the subset for the first one of the listed data frames I can say:

FASTA.1 <- fastafile[c(which(names(fastafile) %in% GOI.split[[1]][,1]))]
# $r1
# [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"
# 
# $r2
# [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"

(and so on) however I would like to subset for all data frames in one swift move to have a list with my desired fastas (3 list items containing, in this case, 2 sequences each). I tried:

FASTAs <- lapply(fastafile, function(i)
{fastafile[c(which(names(fastafile) %in% GOI.split[[i]][ ,1]))]})

Could somebody please tell me why this is not working and what I have to do instead.

Thanks

1

There are 1 answers

0
Jota On BEST ANSWER

This can be done with the following code:

split(fastafile[GOI$ID], rep(1:3,each=2))


$`1`
$`1`$r1
[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"

$`1`$r2
[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"


$`2`
$`2`$r3
[1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca"

$`2`$r4
[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg"


$`3`
$`3`$r5
[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg"

$`3`$r6
[1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"

As to why your lapply code is not working. One reason is that you are passing in fastafile, and you should be passing in indices.

So you are trying this:

fastafile[c(which(names(fastafile) %in% GOI.split[[fastafile[[1]]]][ ,1]))]
#named list()

When you should do this:

fastafile[c(which(names(fastafile) %in% GOI.split[[1]][ ,1]))]
#$r1
#[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"
#
#$r2
#[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"

So, to fix it, pass in 1:length(GOI.split) instead of fastafile:

lapply(1:length(GOI.split), function(i)
 {fastafile[c(which(names(fastafile) %in% GOI.split[[i]][ ,1]))]})

[[1]]
[[1]]$r1
[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"

[[1]]$r2
[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"


[[2]]
[[2]]$r3
[1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca"

[[2]]$r4
[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg"


[[3]]
[[3]]$r5
[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg"

[[3]]$r6
[1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"