I'm attempting to stream a .gz file from S3 using boto and iterate over the lines of the unzipped text file. Mysteriously, the loop never terminates; when the entire file has been read, the iteration restarts at the beginning of the file.
Let's say I create and upload an input file like the following:
> echo '{"key": "value"}' > foo.json
> gzip -9 foo.json
> aws s3 cp foo.json.gz s3://my-bucket/my-location/
and I run the following Python script:
import boto
import gzip
connection = boto.connect_s3()
bucket = connection.get_bucket('my-bucket')
key = bucket.get_key('my-location/foo.json.gz')
gz_file = gzip.GzipFile(fileobj=key, mode='rb')
for line in gz_file:
print(line)
The result is:
b'{"key": "value"}\n'
b'{"key": "value"}\n'
b'{"key": "value"}\n'
...forever...
Why is this happening? I think there must be something very basic that I am missing.
Ah, boto. The problem is that the read method redownloads the key if you call it after the key has been completely read once (compare the read and next methods to see the difference).
This isn't the cleanest way to do it, but it solves the problem: