I am trying to calculate tajD from SNP data (~4000SNPs). I have both fasta and vcf files, and initially tried using my fasta file. With help, I got to the point that i have a script that works to sample from the file, but I needed to split the fasta file into separate populations. I got scared to do this (though I have have to given the error that I am getting now), so I phased my vcf file, and am hoping to using this instead. I'm using the R package pegas, and am getting the following error.
> rm(list=ls(all=T))
> library("ape")
> library("ade4")
> library("adegenet")
> library("pegas")
> b8c18FromVcf <- read.vcf("b8c18_2phased.vcf")
File apparently not yet accessed: Scanning file b8c18_2phased.vcf 3.194102 / 3.194102 Mb Done. Reading 4074 / 4074 loci. Done.
getting haplotypes
> b8c18haplos <- haplotype(b8c18FromVcf)
Analysing individual no. 186 / 186
tajD from haplos
> tajd <- tajima.test(b8c18haplos)
Warning message: In tajima.test(b8c18haplos) : Tajima test requires at least 4 sequences
I will attach links to both my phased and unphased files here. https://drive.google.com/open?id=0B6qb8IlaQGFZTmQ1YXRVbnFSRzA https://drive.google.com/open?id=0B6qb8IlaQGFZQm9HZjZSUkE3NEU
Any thoughts?
Lastly, I am wondering if there is a way to subset populations within the tajD command. I have 7 populations in the same vcf file, and I should calculate tajD for each separately. If not, what is the best tool for subsetting vcfs. I have done loads of googling on this, but none seem to be straightforward.
With thanks,
The developer of pegas responded to me to say that Tajima's D requires DNA sequences. It cannot be executed using a vcf file, even if it is phased.