Extract summary statistic (beta, SE, EA, non-EA, EA frequency) from csv files (extracted from vcf) not working

172 views Asked by At

I download a VCF file + index (https://gwas.mrcieu.ac.uk/datasets/ebi-a-GCST90001942/) I tried to open it on R (see below script). However, R would only load and not open my file. I extracted the VCF as a csv file, and R can it read. However, now I need to obtain the summary statistics (beta, SE, EA, non-EA, EA frequency. I am doing the following

install.packages("tidyverse")
install.packages("remotes")
install.packages("gwasrapidd")

library(tidyverse)
library(remotes)
library(gwasrapidd)

remotes::install_github("mrcieu/ieugwasr")
remotes::install_github("MRCIEU/TwoSampleMR")
remotes::install_github("MRCIEU/gwasvcf")
remotes::install_github("MRCIEU/gwasglue")

library(ieugwasr)
library(TwoSampleMR)
library(gwasvcf)
library(gwasglue)

Using GWAS VCF files

library(gwasvcf)
set_bcftools('/path/to/bcftools')
set_plink('/path/to/plink')

remotes::install_github('mrcieu/genetics.binaRies', force = TRUE)
set_plink()
set_bcftools()


suppressWarnings(suppressPackageStartupMessages({
  library(gwasvcf)
  library(VariantAnnotation)
  library(dplyr)
  library(magrittr)
}))

set_bcftools()

Reading in everything

To read an entire dataset use the readVcf function.

vcffile <- "/Users/path/Desktop/ebi-a-GCST90001942.vcf.gz"
vcf_data <- readVcf(vcffile)

R loads forever and does not open...

2

There are 2 answers

0
Lloyd_LiuSiyi On

I haven't tried gwasvcf package, but did try converting the summary vcf file you provided with MungeSumstats R package and it worked fine.

After installing the MungeSumstats package, you could simply

library(MungeSumstats)
MungeSumstats::format_sumstats("path/ebi-a-GCST90001942.vcf", ref_genome = "GRCh37")

the package automatically converts the vcf into standard summary statistic file.

Dropping 1 duplicate column(s).
1 sample detected: ebi-a-GCST90001942
Constructing ScanVcfParam object.
VCF contains: 15,147,117 variant(s) x 1 sample(s)
Reading VCF file: single-threaded
Converting VCF to data.table.
Expanding VCF first, so number of rows may increase.
Dropping 1 duplicate column(s).
Checking for empty columns.
Unlisting 3 columns.
Time difference of 14.5 mins
VCF data.table contains: 15,141,982 rows x 11 columns.
Time difference of 18 mins
Renaming ID as SNP.
VCF file has -log10 P-values; these will be converted to unadjusted p-values in the 'P' column.
No INFO (SI) column detected.
Standardising column headers.
First line of summary statistics file: 
SNP chr BP  end REF ALT FILTER  AF  ES  SE  LP  P   
Summary statistics report:
   - 15,141,982 rows
   - 15,141,982 unique variants
   - 7 genome-wide significant variants (P<5e-8)
   - 22 chromosomes

the final output is what you expected

Done munging in 37.326 minutes.
Successfully finished preparing sumstats file, preview:
Reading header.
            SNP CHR    BP A1 A2   END FILTER    FRQ    BETA      SE       LP          P
1: rs1238646298   1 10472  G  C 10472   PASS 0.0622 -0.1599 0.08462 1.229440 0.05896034
2: rs1434325972   1 10711  A  G 10711   PASS 0.9958  0.1305 0.22700 0.247567 0.56550051
3: rs1476353024   1 12673  G  A 12673   PASS 0.0006 -0.4995 0.59420 0.397181 0.40069968
4:   rs62028691   1 13118  A  G 13118   PASS 0.9958 -0.1151 0.19130 0.261616 0.54749984
Returning path to saved data.

hope it helps.

0
Giulio Genovese On

You can use data.table and BCFtools/query to load a GWAS-VCF file:

library(data.table)
fmt <- '[%CHROM\\t%POS\\t%ID\\t%ES\\t%SE\\t%ALT\\t%REF\\t%AF\\n]'
hdr <- c('chromosome', 'base_pair_location', 'variant_id', 'beta', 'standard_error', 'effect_allele', 'other_allele', 'effect_allele_frequency')
vcf <- 'ebi-a-GCST90001942.vcf.gz'
cmd <- paste0('bcftools query --format "', fmt, '" "', vcf, '"')
dfrm <- setNames(fread(cmd, sep = '\t', header = FALSE, na.strings = '.', data.table = FALSE), hdr)

It will extract the information of the GWAS-VCF very quickly (<30 seconds for the ebi-a-GCST90001942.vcf.gz GWAS-VCF file):

> head(dfrm)
  chromosome base_pair_location   variant_id    beta standard_error effect_allele other_allele effect_allele_frequency
1          1              10472 rs1307088996 -0.1599        0.08462             C            G                  0.0622
2          1              10473 rs1408062762 -3.8140        8.73300             A            G                      NA
3          1              10711 rs1434325972  0.1305        0.22700             G            A                  0.9958
4          1              12673 rs1476353024 -0.4995        0.59420             A            G                  0.0006
5          1              13118   rs62028691 -0.1151        0.19130             G            A                  0.9958
6          1              13169 rs1436530583  0.3351        0.28880             G            A                  0.9962

It can be easily adapted to your needs. Notice that if your GWAS-VCF file contains more than one phenotype, then you will have to use the BCFtools/query option --samples to extract the correct columns