In order to read few files from common crawl I have written this script
import warc
import boto
for line in sys.stdin:
line = line.strip()
#Connect to AWS and read a dataset
conn = boto.connect_s3(anon=True, host='s3.amazonaws.com')
pds = conn.get_bucket('commoncrawl')
k = Key(pds)
k.key = line
f = warc.WARCFile(fileobj=GzipStreamFile(k))
skipped_doc = 0
for num, record in enumerate(f):
# analysis code
Where each line is the key of warc files. When I run this script to analyze 5 files, I got this exception
Traceback (most recent call last):
File "./warc_mapper_full.py", line 42, in <module>
for num, record in enumerate(f):
File "/usr/lib/python2.7/site-packages/warc/warc.py", line 393, in __iter__
record = self.read_record()
File "/usr/lib/python2.7/site-packages/warc/warc.py", line 364, in read_record
self.finish_reading_current_record()
File "/usr/lib/python2.7/site-packages/warc/warc.py", line 358, in finish_reading_current_record
self.current_payload.read()
File "/usr/lib/python2.7/site-packages/warc/utils.py", line 59, in read
return self._read(self.length)
File "/usr/lib/python2.7/site-packages/warc/utils.py", line 69, in _read
content = self.buf + self.fileobj.read(size)
File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 67, in read
result = super(GzipStreamFile, self).read(*args, **kwargs)
File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 48, in readinto
data = self.read(len(b))
File "/home/hpcnl/Documents/kics/current_work/aws/tasks/warc-analysis/src/gzipstream/gzipstream/gzipstreamfile.py", line 38, in read
raw = self.stream.read(io.DEFAULT_BUFFER_SIZE)
File "/usr/lib/python2.7/site-packages/boto/s3/key.py", line 400, in read
data = self.resp.read(size)
File "/usr/lib/python2.7/site-packages/boto/connection.py", line 413, in read
return http_client.HTTPResponse.read(self, amt)
File "/usr/lib64/python2.7/httplib.py", line 602, in read
s = self.fp.read(amt)
File "/usr/lib64/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
File "/usr/lib64/python2.7/ssl.py", line 736, in recv
return self.read(buflen)
File "/usr/lib64/python2.7/ssl.py", line 630, in read
v = self._sslobj.read(len or 1024)
ssl.SSLError: ('The read operation timed out',)
I run it many times. Above exception happened every time. Where is the problem ?