Programatically Extract Single Specific File From 7zip Archive - Java - Linux

962 views Asked by At

I would really appreciate your input on the below scenario please.

The requirements: - I have a 7zip archive file with several thousands of files in it - I have a java application running on linux that is required to retrieve individual files from the 7 zip file

  • I would like to retrieve a file from the archive by its path (e.g. my7zFile.7z/file1.pdf) without having to iterate through all the files in the archive and comparing file names.

  • I would like to avoid having to extract all files from the archive before running the search (the uncompressed archive is several TB).

I had a look into 7zip Java Binding - specifically the IInArchive class, the only extract method seems to work via file index, not via file name:

http://sevenzipjbind.sourceforge.net/javadoc/net/sf/sevenzipjbinding/IInArchive.html

Do you know of any other libraries that could help me with this use case or am I overlooking a way of doing this with 7zip jbinding?

Thank you

Kind regards,

Tobi

2

There are 2 answers

0
Benjamin Close On BEST ANSWER

Sadly it appears the API doesn't provide enough to fulfill all your requirements. In order to extract a single file it appears you need to walk the archive index. The simplified interface to the archive makes this much easier:

The ISimpleInArchive interface provides:

ISimpleInArchiveItem[]  getArchiveItems()  

Allowing you to retrieve an list of items in the archive. The ISimpleInArchiveItem interface provides the method:

java.lang.String    getPath()

Hence you can walk the archiveItems comparing on path. Granted this is against your requirements.

However, note this walks the index table and does not extract the files until requested. Once you have the item your after you can use:

ExtractOperationResult  extractSlow(ISequentialOutStream SequentialOutStream) 

on the item you have found to actually extract it.

Looking at the 7z file format (note this is not the official site of 7zip), the header information is all at the end of the file with the Signature header at the start of the file giving an offset to the start of the header info. So provided the SevenZip bindings are written nicely, your search will at most read the start of the file (SignatureHeader) to find the offset to the HeaderInfo section, then walk the HeaderInfo section in order to build up the file list required in getArchiveItems(). Only once you have the item you need will it shift back to the index of the actual stream for the file you want extracted (most likely when you call extractSlow).

So whilst not all your requirements are met, the overhead of the search/compare required is limited to only searching the header info of the archive.

0
Wasi Ahmad On

Once I wrote a code to read from all the files and folders from a zip file. I had a long file(text)/folder hierarchy inside the zip file. I am not sure whether that will help you or not. I am sharing the skeleton of the code.

import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

ZipFile zipFile = new ZipFile(filepath); // filepath of the zip file
Enumeration<? extends ZipEntry> entries = zipFile.entries();

while (entries.hasMoreElements()) {
    ZipEntry entry = entries.nextElement();
    if (entry.isDirectory()) { // found directory inside the zipFile
        // write your code here
    } else {
        InputStream stream = zipFile.getInputStream(entry);
        BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
        // write your code to read the content of the file
    }
}

You can modify the code to reach your desired file in the zip. But i don't think you will be able to access the file directly rather you have to walk through all the paths of the zip archive. Note that, ZipFile iterates through all file and folders inside a zipped file in DFS (Depth First Search) manner. You will find detailed relevant examples in web.