How to extract data from grib files in AWS without downloading?

1.1k views Asked by At

I'm looking to access a grib file to extract parameters (such as temperature, etc) from within the cloud without ever having to store the file locally. I've heard this can be done with the cfgrib API, but can't find any example documentation (I checked the source documentation here, but this doesn't include anything for accessing within the cloud).

From experience working with pygrib, I know that API reads in a grib file as a bytes representation, and cfgrib appears to handle it similarly. After some researching and trial and error, I've come up with this code that tries to read a byte string representation of the file:

import boto3 import boto from botocore.config import Config from botocore import UNSIGNED import pygrib import cfgrib

if __name__ == '__main__':
    # Define boto config
    my_config = Config(
    signature_version = UNSIGNED,
    retries = {
        'max_attempts': 10,
        'mode': 'standard'
        }
    )
    
    session = boto3.Session(profile_name='default')
    s3 = session.resource('s3')
    my_bucket = s3.Bucket('nbmdata')
    
    # Get a unique key for each file in s3
    file_keys = []
    for my_bucket_object in my_bucket.objects.all():
        file_keys.append(my_bucket_object.key)
    
    # Extract each file as a binary string (without downloading)
    grib_files = []
    for key in file_keys:
        s3 = boto.connect_s3()
        bucket = s3.lookup('bucket') # Removed bucket name
        key = bucket.lookup(key)
        your_bytes = key.get_contents_as_string(headers={'Range' : 'bytes=73-1024'})
        grib_files.append(your_bytes)
     
    # Interpret binary string into pygrib
    for grib_file in grib_files:
        grbs = pygrib.open(grib_file)

This appears to ALMOST work. I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 7: invalid continuation byte

I get the same error when I try to swap this out with cfgrib. What am I missing here?

2

There are 2 answers

0
Thomas Cannon On

Try something like this. I was using the GEFS data hosted on AWS instead and it worked great. I believe there is nbmdata on AWS also that can be found here: https://registry.opendata.aws/noaa-nbm/. No account should be needed, so it would just be a matter of changing the s3_object name to the path/filename of the file you want from here https://noaa-nbm-pds.s3.amazonaws.com/index.html

import pygrib
import boto3
from botocore import UNSIGNED
from botocore.config import Config

s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
bucket_name = 'noaa-nbm-pds'
s3_object = 'path/to/filename'

obj = s3.get_object(Bucket=bucket_name, Key=s3_object)['Body'].read()
grbs = pygrib.fromstring(obj)

# this should print: <class 'pygrib._pygrib.gribmessage'>
print(type(grbs))
1
Jonathan Leon On

I realize you are using boto to use that particular get_contents_as_string method, but if you are just trying to get the bytes, will this work? I think the bot method is trying to decode as utf-8, so maybe you need specify encoding=None to get bytes array???

But in boto3, I use this without decoding the filestreams and then print the first 50 characters of each file.

grib_files = []
for key in file_keys:
    response = boto3.client('s3').get_object(Bucket='nbmdata', Key=key)
    grib_files.append(response['Body'].read())

for grib in grib_files:
    print(grib[0:50])

b'GRIB\x00\x00\x00\x02\x00\x00\x00\x00\x00\x16\xa7\x7f\x00\x00\x00\x15\x01\x00\x07\x00\x0e\x01\x01\x01\x07\xe5\x05\x1b\x03\x00\x00\x00\x01\x00\x00\x00Q\x03\x00\x009$\xc5\x00\x00\x00'
b'GRIB\x00\x00\x00\x02\x00\x00\x00\x00\x00\x16\x8b\xa8\x00\x00\x00\x15\x01\x00\x07\x00\x0e\x01\x01\x01\x07\xe5\x05\x1b\x03\x00\x00\x00\x01\x00\x00\x00Q\x03\x00\x009$\xc5\x00\x00\x00'

If you try to decode these with utf-8 it throws the same error you are receiving. From here, I don't know how decode and process, so maybe this helps????