ZipArchive does not flush zip item immediately

1k views Asked by At

I am creating a zip file using ZipArchive + FileStream. When new item is added into zip file, I would like to flush/write newly added item to underneath zip stream.

The code below is not flushing the individual zip item. The whole zip gets written to output.zip when FileStream disposes.

var files = Directory.GetFiles("C:\\Temp","*.pdf");
using (var output = new FileStream("c:\\temp\\output.zip", FileMode.Create, FileAccess.Write))
{
    using (System.IO.Compression.ZipArchive zip = new ZipArchive(output, ZipArchiveMode.Create, true))
    {                    
        foreach (var file in files)
        {
            using (var internalFile = new FileStream(file, FileMode.Open))
            {
                
                var zipItem = zip.CreateEntry(Path.GetFileName(file));
                         
                using var entryStream = zipItem.Open();
                {
                    await internalFile.CopyToAsync(entryStream).ConfigureAwait(false);
                }
            }
                                    
            await output.FlushAsync();

            // after each file flush the output stream.
            // expectation at this point, individual zip item will be written to physical file.
            // however I don't see the file size changes in windows explorer.
        } // put breakpoint here
    }
} // The whole output get flush at this point when FileStream is disposed            
4

There are 4 answers

2
GregHNZ On

I'm going to say "this is by design".

It certainly looks like it will hard to get any different behaviour.

The reason why this might be of value from a design point of view relates to how the zip process works. It identifies repeating series of bytes, and rather than writing that series out several times, it writes it once then whenever that sequence of bytes is required, it writes a reference, rather than the entire sequence. That's how the zip file gets to be smaller than the original file. (Caveat: that's my understanding, in lay terms, and it's been a long time since I looked at the zip algorithm).

So it's 'of value' to have the whole file available before it writes, to optimise the identification of duplicate sequences of bytes.

This is some code that looks like ZipArchive from the dotnet runtime github repo.

https://github.com/dotnet/runtime/blob/6072e4d3a7a2a1493f514cdf4be75a3d56580e84/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

(It might not be the latest, or the actual version you're running though).

It looks like compression is done from the private void WriteFile() method. Certainly that's where the seek(0) happens. This method is private and it's only referenced from the Dispose() method.

Your code is calling FlushAsync() on your output stream. This is a standard IO File stream. When you call FlushAsync() it will be writing all of the bytes that the ZipArchive object has given it. Unfortunately, that will be zero bytes.

You could try disposing the ZipArchive after each object is written, but I think that would not be a very happy experiment. I suspect it would rewrite the entire stream each time, rather than individually adding new elements (but I'm not sure).

5
malat On

I am seeing a different behavior that you do today. Here is my small code:

var files = Directory.GetFiles(dir);
using (var output = new FileStream(@"c:\temp\output.zip", FileMode.Create, FileAccess.Write))
{
    using (var zip = new ZipArchive(output, ZipArchiveMode.Create))
    {
        foreach (var file in files)
        {
            using var internalFile = File.Open(file, FileMode.Open, FileAccess.Read);
            //var zipItem = zip.CreateEntry(Path.GetFileName(file), CompressionLevel.NoCompression);
            var zipItem = zip.CreateEntry(Path.GetFileName(file), CompressionLevel.Fastest);

            using var entryStream = zipItem.Open();
            {
                internalFile.CopyTo(entryStream, 4096 /* FileStream.DefaultBufferSize */);
            }
        } // put breakpoint here, and iterate 8 times approx.
    }
}

If I make my debugger stop after 8 iterators (8 files pushed in the zipstream), I can see that the size of the target zip file increases a bit.

0
Emperor Eto On

With ZipArchive and ZipArchiveEntry the compressed data is indeed written to the underlying FileStream when ZipArchiveEntry's Stream is disposed. It is not waiting until the underlying archive stream is closed before compressing and saving.

There's a very easy way to verify this:

        // ...
        Console.WriteLine($"Underlying stream position: {output.Position}");
        await output.FlushAsync();
        // ...

You'll see the FileStream position increases steadily with each file, as expected. This is, in fact, the best that ZipArchiveEntry can realistically do.

The real question is why isn't the OS reporting the increased file size to you even when FlushAsync is called. This gets into the more complex topic of write caching, which is not at all unique to and has nothing to do with ZipArchive. In brief, guaranteeing that bits have been written to hardware, and even knowing whether they have at all, can be very tricky, though you can push things along to an extent.

For example, there is a Flush overload unique to FileStream, FileStream.Flush(bool flushToDisk) that you can try instead of normal Flush. On Windows this forces a call to the Win32 FlushFileBuffers while on Unix it calls fsync. Note the underlying OS write functions are invoked upon Flush regardless of the flushToDisk parameter - so the data has left your process by that point and is in the hands of the OS, which might buffer to disk any time it likes thereafter. In fact in my own testing in Windows, the Zip file size was increasing with each file without using the flushToDisk parameter (though sometimes 1-2 seconds later), and I see another answer notes the same.

However if you're on a slower spinning disk, or a network share, or just have quirky hardware, your OS may in fact wait to flush its buffers until they're full or you force it to. And even then, you should not expect it to always do so instantaneously.

To sum up, you should not be concerned about the lack of a visible file size increase as you add files to the archive; ZipArchive is behaving as you would expect it to, compressing files as they are added, not all at the end.

1
Alexander Burov On

I think this is just an optimization of the Windows Explorer: it sees that the file is open for writing and it doesn't react to every change. You can check file Properties to make sure it's not empty.

Lets add logging of the current stream position to understand how much bytes were written, then we can put breakpoints right after those logging and open file Properties to see the file size.

var files = Directory.GetFiles("C:\\Temp", "*.pdf");
using (var output = new FileStream("c:\\temp\\output.zip", FileMode.Create, FileAccess.Write, FileShare.ReadWrite))
{
    using (ZipArchive zip = new ZipArchive(output, ZipArchiveMode.Create, true))
    {
        foreach (var file in files)
        {
            // file processing is unchanged

            await output.FlushAsync();
            Console.WriteLine($"File {file} processed, stream size is {output.Position} bytes");

        } // put breakpoint here
    }

    await output.FlushAsync();
    Console.WriteLine($"ZipArchive disposed, stream size is {output.Position} bytes");
} // put breakpoint here

In my case they're matching perfectly: enter image description here

We can notice, that there are some extra bytes written when we disposing the ZipArchive object:

File C:\Temp\1.pdf processed, stream size is 34649 bytes
File C:\Temp\2.pdf processed, stream size is 68574 bytes
File C:\Temp\3.pdf processed, stream size is 157989 bytes
ZipArchive disposed, stream size is 158164 bytes

In my case it's 175 bytes representing archive metadata like list of files in the archive.