How to use as.DNAbin{ape} with DNA sequences stored in a dataframe?

8.4k views Asked by At

I have a dataframe with loci names in one column and DNA sequences in the other. I'm trying to use as.DNAbin{ape} or similar to create a DNAbin object.

Here some example data:

x <- structure(c("55548", "43297", "35309", "34468", "AATTCAATGCTCGGGAAGCAAGGAAAGCTGGGGACCAACTTCTCTTGGAGACATGAGCTTAGTGCAGTTAGATCGGAAGAGCA", "AATTCCTAAAACACCAATCAAGTTGGTGTTGCTAATTTCAACACCAACTTGTTGATCTTCACGTTCACAACCGTCTTCACGTT", "AATTCACCACCACCACTAGCATACCATCCACCTCCATCACCACCACCGGTTAAGATCGGAAGAGCACACTCTGAACTCCAGTC", "AATTCTATTGGTCATCACAATGGTGGTCCGTGGCTCACGTGCGTTCCTTGTGCAGGTCAACAGGTCAAGTTAAGATCGGAAGA"), .Dim = c(4L, 2L))

If I try y <- as.DNA(x) R creates a sort of DNAbin object with 4 DNA sequences (the 4 rows of the example) of length 2 (the two columns, I assume), there is no labels and of course the base composition doesn't work either.

The documentation is not very clear, but after playing with the woodmouse example data of the package I think that what I need to do is to create a matrix with each base as a column and then use as.DNAbin. I.e. in the above example a 4 x 84 matrix (1 column for locus name and 83 for the sequences?). Any advice on how to do this? Or any better idea?

Thanks

1

There are 1 answers

1
redmode On BEST ANSWER

First parameter of as.DNAbin should be a matrix or a list containing the DNA sequences, or an object of class "alignment". So, your idea is right.

Given x is the structure from original post, the code below prepares matrix y:

y <- t(sapply(strsplit(x[,2],""), tolower))
rownames(y) <- x[,1]

Then as.DNAbin(y) shows:

4 DNA sequences in binary format stored in a matrix.

All sequences of same length: 83 

Labels: 55548 43297 35309 34468 

Base composition:
    a     c     g     t 
0.289 0.262 0.205 0.244