Parse .bgen files using HAIL without loading data on a single node

Question

Parse .bgen files using HAIL without loading data on a single node

143 views Asked by Sylvi0202 At 10 September 2020 at 10:28

I am trying to parse genomic data that is delivered in a .bgen format to a Spark DF using HAIL. The file is 150 GB large and it won't fit into a single node on my cluster.

I am wondering whether there are streaming commands/ways to parse the data into my desired target format that don't require me to load the data into memory up front.

I would really appreciate any inputs/ideas! Thanks a lot!

Original Q&A

There are 2 answers

**Carl** · Answer 1 · 2020-09-11T15:56:13+00:00

Could you use use a stand-alone BGEN reader to get what you need and then move it to the format you want?

    import numpy as np
    from bgen_reader import open_bgen

    bgen = open_bgen("M:/deldir/genbgen/good/merged_487400x1100000.bgen")
     # read all samples and variants 1M to 1M+31
    val = bgen.read(np.s_[:,1000000:1000031])
    print(val.shape)

=> (487400, 31, 3)

The 'bed-reader' library offers a NumPy-inspired API that makes it very fast and easy to read slices of BGEN files into NumPy arrays. The first time it reads, it creates a metadata file. After that, it starts instantly and it reads millions of probabilities per second.

I'm happy to help with usage or questions.

Carl

**Daniel King** · Answer 2 · 2023-01-06T21:13:55+00:00

Hail does not load the data into memory, it streams through it. What error did you encounter? The following should work just fine:

import hail as hl

mt = hl.import_bgen('gs://path/to/file.bgen')
mt.show()

You can use to_spark to get a Spark data frame from the Hail Matrix Table.

TechQA.

Parse .bgen files using HAIL without loading data on a single node

There are 2 answers

Related Questions in SPARK-STREAMING

Related Questions in GENETICS

Related Questions in GENOME

Related Questions in HAIL

Popular Questions

Popular Tags

Trending Questions