Subset sequence data in fasta file based on IDs stored in listed data frames

Question

Subset sequence data in fasta file based on IDs stored in listed data frames

742 views Asked by Moritz At 19 June 2015 at 15:52

I am trying to subset one FASTA file (containing multiple sequences) into several smaller ones based on IDs I stored in a list of data frames (and

I have a FASTA called fastafile like this:

 fastafile <- dput(fastafile)
structure(list(r1 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac", 
    r2 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag", 
    r3 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca", 
    r4 = "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg", 
    r5 = "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg", 
    r6 = "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"), .Names = c("r1", 
"r2", "r3", "r4", "r5", "r6"))

that I loaded using seqinr package like that:

fastafile <- read.fasta(file = "fastafile.fasta", 
                       seqtype = c("DNA","AA"),
                       as.string = TRUE, set.attributes = FALSE)

I load a table with my IDs and some expression values

GOI <- read.table(header = TRUE, text = "ID        T1        T2
1 r1 1.1 2.1
2 r2 1.2 2.2
3 r3 1.1 2.2
4 r4 1.2 2.1
5 r5 1.1 2.1
6 r6 1.2 2.2")

and split them into manageable subsets

GOI.split <- split(GOI,rep(1:3,each=2))

giving me

> GOI.split
$`1`
  ID  T1  T2
1 r1 1.1 2.1
2 r2 1.2 2.2

$`2`
  ID  T1  T2
3 r3 1.1 2.2
4 r4 1.2 2.1

$`3`
  ID  T1  T2
5 r5 1.1 2.1
6 r6 1.2 2.2

Now I would like to subset my sequences based on the IDs in the GOI.split data frames. In this mock example it should be two sequences per list item. To get the subset for the first one of the listed data frames I can say:

FASTA.1 <- fastafile[c(which(names(fastafile) %in% GOI.split[[1]][,1]))]
# $r1
# [1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"
# 
# $r2
# [1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"

(and so on) however I would like to subset for all data frames in one swift move to have a list with my desired fastas (3 list items containing, in this case, 2 sequences each). I tried:

FASTAs <- lapply(fastafile, function(i)
{fastafile[c(which(names(fastafile) %in% GOI.split[[i]][ ,1]))]})

Could somebody please tell me why this is not working and what I have to do instead.

Thanks

Original Q&A

There are 1 answers

**Jota** · Accepted Answer · 2015-06-19T17:35:22+00:00

This can be done with the following code:

split(fastafile[GOI$ID], rep(1:3,each=2))


$`1`
$`1`$r1
[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"

$`1`$r2
[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"


$`2`
$`2`$r3
[1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca"

$`2`$r4
[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg"


$`3`
$`3`$r5
[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg"

$`3`$r6
[1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"

As to why your lapply code is not working. One reason is that you are passing in fastafile, and you should be passing in indices.

So you are trying this:

fastafile[c(which(names(fastafile) %in% GOI.split[[fastafile[[1]]]][ ,1]))]
#named list()

When you should do this:

fastafile[c(which(names(fastafile) %in% GOI.split[[1]][ ,1]))]
#$r1
#[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"
#
#$r2
#[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"

So, to fix it, pass in 1:length(GOI.split) instead of fastafile:

lapply(1:length(GOI.split), function(i)
 {fastafile[c(which(names(fastafile) %in% GOI.split[[i]][ ,1]))]})

[[1]]
[[1]]$r1
[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcac"

[[1]]$r2
[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgag"


[[2]]
[[2]]$r3
[1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgca"

[[2]]$r4
[1] "acatattggaggccgaaacaatgaggcgtgatcaactcagtatatcgg"


[[3]]
[[3]]$r5
[1] "ctaacctctcccagtgtggaacctctatctcatgagaaagctgggatgg"

[[3]]$r6
[1] "atttcctcctgctgcccgggaggtaacaccctggacccctggagtctgg"

TechQA.

Subset sequence data in fasta file based on IDs stored in listed data frames

There are 1 answers

Related Questions in R

Related Questions in SUBSET

Related Questions in FASTA

Related Questions in SEQ

Popular Questions

Popular Tags

Trending Questions