Infinite loop when streaming a .gz file from S3 using boto

3.7k views Asked by At

I'm attempting to stream a .gz file from S3 using boto and iterate over the lines of the unzipped text file. Mysteriously, the loop never terminates; when the entire file has been read, the iteration restarts at the beginning of the file.

Let's say I create and upload an input file like the following:

> echo '{"key": "value"}' > foo.json
> gzip -9 foo.json
> aws s3 cp foo.json.gz s3://my-bucket/my-location/

and I run the following Python script:

import boto
import gzip

connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = bucket.get_key('my-location/foo.json.gz')
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
    print(line)

The result is:

b'{"key": "value"}\n'
b'{"key": "value"}\n'
b'{"key": "value"}\n'
...forever...

Why is this happening? I think there must be something very basic that I am missing.

2

There are 2 answers

4
zweiterlinde On BEST ANSWER

Ah, boto. The problem is that the read method redownloads the key if you call it after the key has been completely read once (compare the read and next methods to see the difference).

This isn't the cleanest way to do it, but it solves the problem:

import boto
import gzip

class ReadOnce(object):
    def __init__(self, k):
        self.key = k
        self.has_read_once = False

   def read(self, size=0):
       if self.has_read_once:
           return b''
       data = self.key.read(size)
       if not data:
           self.has_read_once = True
       return data

connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = ReadOnce(bucket.get_key('my-location/foo.json.gz'))
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
    print(line)
1
Pierre D On

Thanks zweiterlinde for the wonderful insight and excellent answer provided.

I was looking for a solution to read a compressed S3 object directly into a Pandas DataFrame, and using his wrapper, it can be expressed in two lines:

with gzip.GzipFile(fileobj=ReadOnce(bucket.get_key('my/obj.tsv.gz')), mode='rb') as f:
    df = pd.read_csv(f, sep='\t')