Zip Create Process with Node Express of large ZIP packages

2.8k views Asked by At

Goal

We standing up a low volume site, where users (browser client) will select image files (284 KB per file) and then request a Node Express Server to bundle them into a ZIP for download to the web client.

Issues & Design Constraints

  • The resultant ZIP might be on the order of 50 MB - 5 GB. Therefore we would like to give the user a running progress bar while the ZIP is being constructed. (We assume the browser will give running updates as to the progress of the actual download).
  • While we expect low volume of requests (1-2 request at a time). However, we do not want to completely tie up our 4 core server processor, so we want to minimize synchronous calls that tie up the express server.
  • Given the size of the ZIP, we cannot expect the zip to be assembled only in memory
  • Is there any other issues we should worry about?

Question

We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.

So which of the following packages are very Node/ExpressJS friendly packages given the design constraints/goals listed above?

What I am seeing above is that most packages first collect the files, and then finalize them to memory and then pipe them to the http request (probably not good for 5GB of data or am I missing something). Some seem to be able to use disk, but the question will be does one get update events as each file is added?

Others seem to be fully async and I don't see how you would get a running progress value as each file added to the ZIP package.

2

There are 2 answers

6
Dr.YSG On BEST ANSWER

Of the packages listed above. Most were not appropriate

  • JSZIP is mainly for the browser
  • EasyZip is a node wrapper for of JSZIP, but it does not provide progress notifications durring creation
  • Express-Zip is an in-memory express friendly RES solution (but probably would not handle the size of the ZIP we are talking about)
    • ZIP-Stream is underlying utility underleath Archiver. Archiver has the queuing services, so one should just user archiver
  • YAZL might work, but the interface is more complex for progress tracking than Archiver

We chose Archiver, since it had most of the features desired:

  • Express Friendly
  • low memory footprint
  • as fast as 7ZIP for the particular image archives we create (we don't need to compress, files are large, etc.) You might have 25% hit in performance for other types of archives
  • It does not let you append to existing archives (that was one feature we wanted), but adm-zip might provide that gap

As for the 7zip solution. We tend not to like reading the entrails of a standard output stream from a spawned child process.

  • It is messy to find strings int he streams
  • it causes context switches to read the stream,
  • you have a brittle solution trying to deal with what output stream puts out (e.g. in the case of 7zip it sometimes leaps the counter by 30% sometimes by 1%), as well as other sources for brittle solutions.
8
jfriend00 On

We assume that running 7zip as a child process is bad, since we would not get any running status as to how many of the 258KB files had been added to the ZIP.

That appears to be a false assumption.

A command line like this will show progress for each file added to the archive on stdout as each new file is added:

7z a -bsp1 -bb3 test.7z *

So, you can launch that from node.js using the child process module and you should be able to capture the stdout progress as it happens. You will need to use spawn, not exec so you can get the stdout data live as it happens.

Running this as a child process will keep your nodejs process free to serve other requests and will allow the child process to manage its own memory, independent of nodejs.

The 7zip program handles extremely large archives and files with appropriate memory usage. With the right flags to get progress to stdout and running it as a child process, it appears to meet all your requirements.