Coding genotypes from nucleotides to 0/1/2 in R (or Python)

1.3k views Asked by At

I have a data table with dbSNP rs ids as rows and samples as columns in this kind of format:

     dbSNP Sample Sample Sample Sample Sample Sample
rs10000011     CC     CC     CC     CC     TC     TC
rs1000002      TC     TT     CC     TT     TT     TC
rs10000023     TG     TG     TT     TG     TG     TG
rs1000003      AA     AG     AG     AA     AA     AG
rs10000041     TT     TG     TT     TT     TG     GG
rs10000046     GG     GG     AG     GG     GG     GG

There are over a 1000 samples and 100s of 1000s of loci in this table, and I would like to do a massive Principal Component Analysis (with samples colored based on population).

In order to do that, I need to code my genotypes first. I was wondering, how would I do this (preferably in R/Python, as the file is too big for JMP Genomics)?

Also, I have some spots lacking data, which are indicated by --- or 00. I am going to standardize those to NA or . using a find and replace script, but how do I code it so R will still be able to run the PCA?

1

There are 1 answers

0
Nick Kennedy On

While you could roll your own method to do this, I would suggest that for both the conversion and PCA that you use an established R package. One option would be SNPRelate which supports doing LD pruning and PCA (tutorial). You'll need to convert your data into plink first. Plink can also recode your genotypes.

At bash command line for a table of genotypes as shown in the example:

<snpExample.txt head -n1 |cut -f2- | tr '\t' '\n' | perl -pe 's/(.*)/\1\t\1\t0\t0\t0\t0/' > snp.tfam
<snpExample.txt tail -n+2 | perl -pe's/\t([ACTG\-])([ACTG\-])/\t\1 \2/g;s/^([^ \t]+)/0\t\1\t0\t0/' > snp.tped

Now using plink (1.90):

plink --tfile snp --out snp --make-bed

recodes data as bed for import into SNPRelate in R using:

snpgdsBED2GDS(bed.fn, fam.fn, bim.fn, "test.gds")

Plink can also recode as numeric genotypes:

plink --tfile snp --out snp --recodeAD

Note that the matrix is the transpose of what you started with and there are heterozygote columns.