How to start a download and render a response without hitting disk?

96 views Asked by At

So I have a scientific data Excel file validation form in django that works well. It works iteratively. Users can upload files as they accumulate new data that they add to their study. The DataValidationView inspects the files each time and presents the user with an error report that lists issues in their data that they must fix.

We realized recently that a number of errors (but not all) can be fixed automatically, so I've been working on a way to generate a copy of the file with a number of fixes. So we rebranded the "validation" form page as a "build a submission page". Each time they upload a new set of files, the intention is for them to still get the error report, but also automatically receive a downloaded file with a number of fixes in it.

I learned just today that there's no way to both render a template and kick off a download at the same time, which makes sense. However, I had been planning to not let the generated file with fixes hit the disk.

Is there a way to present the template with the errors and automatically trigger the download without previously saving the file to disk?

This is my form_valid method currently (without the triggered download, but I had started to do the file creation before I realized that both downloading and rendering a template wouldn't work):

    def form_valid(self, form):
        """
        Upon valid file submission, adds validation messages to the context of
        the validation page.
        """

        # This buffers errors associated with the study data
        self.validate_study()

        # This generates a dict representation of the study data with fixes and
        # removes the errors it fixed
        self.perform_fixes()

        # This sets self.results (i.e. the error report)
        self.format_validation_results_for_template()

        # HERE IS WHERE I REALIZED MY PROBLEM.  I WANTED TO CREATE A STREAM HERE
        # TO START A DOWNLOAD, BUT REALIZED I CANNOT BOTH PRESENT THE ERROR REPORT
        # AND START THE DOWNLOAD FOR THE USER

        return self.render_to_response(
            self.get_context_data(
                results=self.results,
                form=form,
                submission_url=self.submission_url,
            )
        )

Before I got to that problem, I was compiling some pseudocode to stream the file... This is totally untested:

import pandas as pd
from django.http import HttpResponse
from io import BytesIO

def download_fixes(self):
    excel_file = BytesIO()
    xlwriter = pd.ExcelWriter(excel_file, engine='xlsxwriter')

    df_output = {}
    for sheet in self.fixed_study_data.keys():
        df_output[sheet] = pd.DataFrame.from_dict(self.fixed_study_data[sheet])
        df_output[sheet].to_excel(xlwriter, sheet)

    xlwriter.save()
    xlwriter.close()

    # important step, rewind the buffer or when it is read() you'll get nothing
    # but an error message when you try to open your zero length file in Excel
    excel_file.seek(0)

    # set the mime type so that the browser knows what to do with the file
    response = HttpResponse(excel_file.read(), content_type='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')

    # set the file name in the Content-Disposition header
    response['Content-Disposition'] = 'attachment; filename=myfile.xlsx'

    return response

So I'm thinking either I need to:

  1. Save the file to disk and then figure out a way to make the results page start its download
  2. Somehow send the data embedded in the results template and sent it back via javascript to be turned into a file download stream
  3. Save the file somehow in memory and trigger its download from the results template?

What's the best way to accomplish this?

UPDATED THOUGHTS:

I recently had done a simple trick with a tsv file where I embedded the file content in the resulting template with a download button that used javascript to grab the innerHTML of the tags around the data and start a "download".

I thought, if I encode the data, I could likely do something similar with the excel file content. I could base64 encode it.

I reviewed past study submissions. The largest one was 115kb. That size is likely to grow by an order of magnitude, but for now 115kb is the ceiling.

I googled to find a way to embed the data in the template and I got this:

import base64
with open(image_path, "rb") as image_file:
    image_data = base64.b64encode(image_file.read()).decode('utf-8')
ctx["image"] = image_data
return render(request, 'index.html', ctx)

I recently was playing around with base64 encoding in javascript for some unrelated work, which leads me to believe that embedding is do-able. I could even trigger it automatically. Anyone have any caveats to doing it this way?

Update

I have spent all day trying to implement @Chukwujiobi_Canon's suggestion, but after working through a lot of errors and things I'm inexperienced with, I'm at the point where I am stuck. A new tab is opened (but it's empty) and a file is downloaded, but it won't open (and there's a error in the browser console saying "Frame load interrupted".

I implemented the django code first and I think it is working correctly. When I submit the form without the javascript, the browser downloads the multipart stream, and it looks as expected:

--3d6b6a416f9b5
Content-Type: application/octet-stream
Content-Range: bytes 0-9560/9561

PK?N˝Ö€]'[Content_Types].xm...

...

--3d6b6a416f9b5
Content-Type: text/html
Content-Range: bytes 0-16493/16494


<!--use Bootstrap CSS and JS 5.0.2-->
...

</html>

--3d6b6a416f9b5--

Here's the javascript:

validation_form = document.getElementById("submission-validation");

// Take over form submission
validation_form.addEventListener("submit", (event) => {
    event.preventDefault();
    submit_validation_form();
});
async function submit_validation_form() {
    // Put all of the form data into a variable (formdata)
    const formdata = new FormData(validation_form);
    try {
        // Submit the form and get a response (which can only be done inside an async functio
        let response;
        response = await fetch("{% url 'validate' %}", {
            method: "post",
            body: formdata,
        })
        let result;
        result = await response.text();
        const parsed = parseMultipartBody(result, "{{ boundary }}");
        parsed.forEach(part => {
            if (part["headers"]["content-type"] === "text/html") {
                const url = URL.createObjectURL(
                    new Blob(
                        [part["body"]],
                        {type: "text/html"}
                    )
                );
                window.open(url, "_blank");
            }
            else if (part["headers"]["content-type"] === "application/octet-stream") {
                console.log(part)
                const url = URL.createObjectURL(
                    new Blob(
                        [part["body"]],
                        {type: "application/octet-stream"}
                    )
                );
                window.location = url;
            }
        });
    } catch (e) {
        console.error(e);
    }
}
function parseMultipartBody (body, boundary) {
    return body.split(`--${boundary}`).reduce((parts, part) => {
        if (part && part !== '--') {
            const [ head, body ] = part.trim().split(/\r\n\r\n/g)
            parts.push({
                body: body,
                headers: head.split(/\r\n/g).reduce((headers, header) => {
                    const [ key, value ] = header.split(/:\s+/)
                    headers[key.toLowerCase()] = value
                    return headers
                }, {})
            })
        }
        return parts
    }, [])
}

The server console output looks fine, but so far, the outputs are non-functional.

2

There are 2 answers

9
Chukwujiobi Canon On BEST ANSWER

For posterity, a guide to HTTP 1.1 multipart/byteranges Response implemented in Django. For more information on multipart/byteranges see RFC 7233.

The format of a multipart/byteranges payload is as follows:

HTTP/1.1 206 Partial Content
Content-Type: multipart/byteranges; boundary=3d6b6a416f9b5

--3d6b6a416f9b5
Content-Type: application/octet-stream
Content-Range: bytes 0-999/2000

<octet stream data 1>

--3d6b6a416f9b5
Content-Type: application/octet-stream
Content-Range: bytes 1000-1999/2000

<octet stream data 2>

--3d6b6a416f9b5
Content-Type: application/json
Content-Range: bytes 0-441/442

<json data>

--3d6b6a416f9b5
Content-Type: text/html 
Content-Range: bytes 0-543/544

<html string>
--3d6b6a416f9b5--

You get the idea. The first two are of the same binary data split into two streams, the third is a JSON string sent in one stream and the fourth is a HTML string sent in one stream.

In your case, you are sending a File together with your HTML template.

from io import BytesIO, StringIO
from django.template.loader import render_to_string
from django.http import StreamingHttpResponse


def stream_generator(streams):
    boundary = "3d6b6a416f9b5"
    for stream in streams:
        if isinstance(stream, BytesIO):
            data = stream.getvalue()
            content_type = 'application/octet-stream'
        elif isinstance(stream, StringIO):
            data = stream.getvalue().encode('utf-8')
            content_type = 'text/html'
        else:
            continue
        
        stream_length = len(data)
        yield f'--{boundary}\r\n'
        yield f'Content-Type: {content_type}\r\n'
        yield f'Content-Range: bytes 0-{stream_length-1}/{stream_length}\r\n'
        yield f'\r\n'
        yield data
        yield f'\r\n'

    yield f'--{boundary}--\r\n'

def multi_stream_response(request):
    streams = [
        excel_file, # The File provided in the OP. It is a BytesIO object.
        StringIO(render_to_string('index.html', request=request))
    ]
    return StreamingHttpResponse(stream_generator(streams), content_type='multipart/byteranges; boundary=3d6b6a416f9b5')

See this example [stackoverflow] on parsing a multipart/byteranges on the client.

2
hepcat72 On

@Chukwujiobi_Canon's answer is excellent, and scalable, though it did take me a whole day to almost get it working, and it's still not quite there. I expect it may take me another day to perfect it, however given my files are under 1mb in size, I decided to explore my original thought: embed the file content in base64 in the rendered page (hidden), and trigger it's download automatically in javascript.

It took me under an hour, it is fully functional, and it took very little code. Granted, some of that code was re-used from the work on the other solution.

Here is how I generate the file content. I included the method that takes a pandas-style dict and converts it to and xlsxwriter (pip install xlsxwriter).

    import xlswriter

    def form_valid(self, form):

        # This buffers errors associated with the study data
        self.validate_study()

        # This generates a dict representation of the study data with fixes and
        # removes the errors it fixed
        self.perform_fixes()

        # This sets self.results (i.e. the error report)
        self.format_validation_results_for_template()

        study_stream = BytesIO()

        xlsxwriter = self.create_study_file_writer(study_stream)

        xlsxwriter.close()
        # Rewind the buffer so that when it is read(), you won't get an error about opening a zero-length file in Excel
        study_stream.seek(0)

        study_data = base64.b64encode(study_stream.read()).decode('utf-8')
        study_filename = self.animal_sample_filename
        if self.animal_sample_filename is None:
            study_filename = "study.xlsx"

        return self.render_to_response(
            self.get_context_data(
                results=self.results,
                form=form,
                submission_url=self.submission_url,
                study_data=study_data,
                study_filename=study_filename,
            ),
        )

    def create_study_file_writer(self, stream_obj: BytesIO):
        xlsxwriter = pd.ExcelWriter(stream_obj, engine='xlsxwriter')

        # This iterates over the desired order of the sheets and their columns
        for order_spec in self.get_study_sheet_column_display_order():

            sheet = order_spec[0]
            columns = order_spec[1]

            # Create a dataframe and add it as an excel object to an xlsxwriter sheet
            pd.DataFrame.from_dict(self.dfs_dict[sheet]).to_excel(
                excel_writer=xlsxwriter,
                sheet_name=sheet,
                columns=columns
            )

        return xlsxwriter

This is the tag in the template where I render the data

<pre style="display: none" id="output_study_file">{{study_data}}</pre>

This is the javascript that "downloads" the file:

document.addEventListener("DOMContentLoaded", function(){
    // If there is a study file that was produced
    if ( typeof study_file_content_tag !== "undefined" && study_file_content_tag ) {
        browserDownloadExcel('{{ study_filename }}', study_file_content_tag.innerHTML)
    }
})

function browserDownloadExcel (filename, base64_text) {
    const element = document.createElement('a');
    element.setAttribute(
        'href',
        'data:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet;base64,' + encodeURIComponent(base64_text)
    );
    element.setAttribute('download', filename);
    element.style.display = 'none';
    document.body.appendChild(element);
    element.click();
    document.body.removeChild(element);
}