I am trying to parse genomic data that is delivered in a .bgen format to a Spark DF using HAIL. The file is 150 GB large and it won't fit into a single node on my cluster.
I am wondering whether there are streaming commands/ways to parse the data into my desired target format that don't require me to load the data into memory up front.
I would really appreciate any inputs/ideas! Thanks a lot!
Could you use use a stand-alone BGEN reader to get what you need and then move it to the format you want?
=> (487400, 31, 3)
The 'bed-reader' library offers a NumPy-inspired API that makes it very fast and easy to read slices of BGEN files into NumPy arrays. The first time it reads, it creates a metadata file. After that, it starts instantly and it reads millions of probabilities per second.
I'm happy to help with usage or questions.