I have a data table with dbSNP rs ids as rows and samples as columns in this kind of format:
dbSNP Sample Sample Sample Sample Sample Sample
rs10000011 CC CC CC CC TC TC
rs1000002 TC TT CC TT TT TC
rs10000023 TG TG TT TG TG TG
rs1000003 AA AG AG AA AA AG
rs10000041 TT TG TT TT TG GG
rs10000046 GG GG AG GG GG GG
There are over a 1000 samples and 100s of 1000s of loci in this table, and I would like to do a massive Principal Component Analysis (with samples colored based on population).
In order to do that, I need to code my genotypes first. I was wondering, how would I do this (preferably in R/Python, as the file is too big for JMP Genomics)?
Also, I have some spots lacking data, which are indicated by ---
or 00
. I am going to standardize those to NA
or .
using a find and replace script, but how do I code it so R will still be able to run the PCA?
While you could roll your own method to do this, I would suggest that for both the conversion and PCA that you use an established R package. One option would be
SNPRelate
which supports doing LD pruning and PCA (tutorial). You'll need to convert your data into plink first. Plink can also recode your genotypes.At bash command line for a table of genotypes as shown in the example:
Now using plink (1.90):
recodes data as bed for import into SNPRelate in R using:
Plink can also recode as numeric genotypes:
Note that the matrix is the transpose of what you started with and there are heterozygote columns.