How do you get Google App Engine to gunzip during download?

495 views Asked by At

I am trying to get Google App Engine to gunzip my .gz blob file (single file compressed) automatically by setting the response headers as follows:

class download(blobstore_handlers.BlobstoreDownloadHandler):
    def get(self, resource):
        resource = str(urllib.unquote(resource))
        blob_info = blobstore.BlobInfo.get(resource)
        self.response.headers['Content-Encoding'] = str('gzip')
        # self.response.headers['Content-type'] = str('application/x-gzip')
        self.response.headers['Content-type'] = str(blob_info.content_type)
        self.response.headers['Content-Length'] = str(blob_info.size)
        cd = 'attachment; filename=%s' % (blob_info.filename)
        self.response.headers['Content-Disposition'] = str(cd)
        self.response.headers['Cache-Control'] = str('must-revalidate, post-check=0, pre-check=0')
        self.response.headers['Pragma'] = str(' public')
        self.send_blob(blob_info)

When this runs, the file is downloaded without the .gz extension. However, the downloaded file is still gzipped. The file size of the downloaded data match the .gz file size on the server. Also, I can confirm this by manually gunzipping the downloaded file. I am trying to avoid the manual gunzip step.

I am trying to get the blob file to automatically gunzip during the download. What am I doing wrong?

By the way, the gzip file contains only a single file. On my self-hosted (non Google) server, I could accomplish the automatic gunzip by setting same response headers; albeit, my code there is written in PHP.

UPDATE:

I rewrote the handler to serve data from the bucket. However, this generates HTML 500 error. The file is partially downloaded before the failure. The rewrite is as follows:

class download(blobstore_handlers.BlobstoreDownloadHandler):
    def get(self, resource):
        resource = str(urllib.unquote(resource))
        blob_info = blobstore.BlobInfo.get(resource)
        file = '/gs/mydatabucket/%s' % blob_info.filename
        print file
        self.response.headers['Content-Encoding'] = str('gzip')
        self.response.headers['Content-Type'] = str('application/x-gzip')
        # self.response.headers['Content-Length'] = str(blob_info.size)
        cd = 'filename=%s' % (file)
        self.response.headers['Content-Disposition'] = str(cd)
        self.response.headers['Cache-Control'] = str('must-revalidate, post-check=0, pre-check=0')
        self.response.headers['Pragma'] = str(' public')
        self.send_blob(file)

This downloads 540,672 bytes of the 6,094,848 bytes file to the client before the server terminate and issued a 500 error. When I issue 'file' on the partially downloaded file from the command line, Mac OS seems to correctly identify the file format as 'SQLite 3.x database' file. Any idea of why the 500 error on the server? How can I fix the problem?

2

There are 2 answers

2
someone1 On BEST ANSWER

You should first check to see if your requesting client supports gzipped content. If it does support gzip content encoding, then you may pass the gzipped blob as is with the proper content-encoding and content-type headers, otherwise you need to decompress the blob for the client. You should also verify that your blob's content_type isn't gzip (this depends on how you created your blob to begin with!)

You may also want to look at Google Cloud Storage as this automatically handles gzip transportation so long as you properly compress the data before storing it with the proper content-encoding and content-type metadata.

See this SO question: Google cloud storage console Content-Encoding to gzip

Or the GCS Docs: https://cloud.google.com/storage/docs/gsutil/addlhelp/WorkingWithObjectMetadata#content-encoding

You may use GCS as easily (if not more easily) as you use the blobstore in AppEngine and it seems to be the preferred storage layer to use going forward. I say this because the File API has been deprecated which made blobstore interaction easier and great efforts and advancements have been made to the GCS libraries making the API similar to the base python file interaction API

UPDATE:

Since the objects are stored in GCS, you can use 302 redirects to point users to files rather than relying on the Blobstore API. This eliminates any unknown behavior of the Blobstore API and GAE delivering your stored objects with the content-type and content-encoding you intended to use. For objects with a public-read ACL, you may simply direct them to either storage.googleapis.com/<bucket>/<object> or <bucket>.storage.googleapis.com/<object>. Alternatively, if you'd like to have application logic dictate access, you should keep the ACL to the objects private and can use GCS Signed URLs to create short lived URLs to use when doing a 302 redirect.

Its worth noting that if you want users to be able to upload objects via GAE, you'd still use the Blobstore API to handle storing the file in GCS, but you'd have to modify the object after it was uploaded to ensure proper gzip compressing and content-encoding meta data is used.

class legacy_download(blobstore_handlers.BlobstoreDownloadHandler):
    def get(self, resource):
        filename = str(urllib.unquote(resource))
        url = 'https://storage.googleapis.com/mybucket/' + filename
        self.redirect(url)
0
Christiaan On

GAE already serves everything using gzip if the client supports it. So I think what's happening after your update is that the browser expects there to be more of the file, but GAE thinks it's already at the end of the file since it's already gzipped. That's why you get the 500. (if that makes sense)

Anyway, since GAE already handles compression for you, the easiest way is probably to put non compressed files in GCS and let the Google infrastructure handle the compression automatically for you when you serve them.