S3 the read operation timed out while reading commoncrawl data

811 views Asked by At

In order to read few files from common crawl I have written this script

import warc
import boto    

for line in sys.stdin:
        line = line.strip()
        #Connect to AWS and read a dataset
        conn = boto.connect_s3(anon=True, host='s3.amazonaws.com')
        pds = conn.get_bucket('commoncrawl')
        k = Key(pds)
        k.key = line

        f = warc.WARCFile(fileobj=GzipStreamFile(k))
        skipped_doc = 0
        for num, record in enumerate(f):
            # analysis code

Where each line is the key of warc files. When I run this script to analyze 5 files, I got this exception

Traceback (most recent call last):
  File "./warc_mapper_full.py", line 42, in <module>
    for num, record in enumerate(f):
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 393, in __iter__
    record = self.read_record()
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
    self.finish_reading_current_record()
  File "/usr/lib/python2.7/site-packages/warc/warc.py", line 358, in finish_reading_current_record
    self.current_payload.read()
  File "/usr/lib/python2.7/site-packages/warc/utils.py", line 59, in read
    return self._read(self.length)
  File "/usr/lib/python2.7/site-packages/warc/utils.py", line 69, in _read
    content = self.buf + self.fileobj.read(size)
  File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 67, in read
    result = super(GzipStreamFile, self).read(*args, **kwargs)
  File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 48, in readinto
    data = self.read(len(b))
  File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 38, in read
    raw = self.stream.read(io.DEFAULT_BUFFER_SIZE)
  File "/usr/lib/python2.7/site-packages/boto/s3/key.py", line 400, in read
    data = self.resp.read(size)
  File "/usr/lib/python2.7/site-packages/boto/connection.py", line 413, in read
    return http_client.HTTPResponse.read(self, amt)
  File "/usr/lib64/python2.7/httplib.py", line 602, in read
    s = self.fp.read(amt)
  File "/usr/lib64/python2.7/socket.py", line 380, in read
    data = self._sock.recv(left)
  File "/usr/lib64/python2.7/ssl.py", line 736, in recv
    return self.read(buflen)
  File "/usr/lib64/python2.7/ssl.py", line 630, in read
    v = self._sslobj.read(len or 1024)
ssl.SSLError: ('The read operation timed out',)

I run it many times. Above exception happened every time. Where is the problem ?

0

There are 0 answers