Set UTF-8 pathname header in libarchive

2.7k views Asked by At

SUMMARY

How can I write a zip file using libarchive in C++, such that path names will be UTF-8 encoded? With UTF-8 path names, special characters will be decoded correctly when using OS X / Linux / Windows 8 / 7-Zip / WinZip.

DETAILS

I am trying to write a zip archive using libarchive, compiling with Visual C++ 2013 on Windows.

I would like to be able to add files with non-ASCII chars (e.g. äöü.txt) to the zip archive.

There are four functions to set the pathname header in libarchive:

void archive_entry_set_pathname(struct archive_entry *, const char *);
void archive_entry_copy_pathname(struct archive_entry *, const char *);
void archive_entry_copy_pathname_w(struct archive_entry *, const wchar_t *);
int  archive_entry_update_pathname_utf8(struct archive_entry *, const char *);

Unfortunately, none of them seem to work.

In particular, I have tried:

const char* myUtf8Str = ...
archive_entry_update_pathname_utf8(entry, myUtf8Str);
// this sounded like the most straightforward solution

and

const wchar_t* myUtf16Str = ...
archive_entry_copy_pathname_w(entry, myUtf16Str);
// UTF-16 encoded strings seem to be the default on Windows

In both cases, the resulting zip archive does not show the file names correctly in both Windows Explorer and 7-Zip.

I am certain that my input strings are encoded correctly, since I convert them from Qt QString instances that work perfectly well in other parts of my code:

const char* myUtf8Str = filename.toUtf8().constData();
const wchar_t* myUtf16Str = filename.toStdWString().c_str();

For instance, this works even for another call to libarchive, when creating the zip file:

archive_write_open_filename_w(archive, zipFile.toStdWString().c_str());
// creates a zip archive file where the non-ASCII
// chars are encoded correctly, e.g. äöü.zip

I have also tried to change the options for libarchive, as suggested by this example:

archive_write_set_options(a, "hdrcharset=UTF-8");

But this call fails, so I assume that I have to set some other option, but I'm running out of ideas...

UPDATE 2

I have done some more reading about the zip format. It allows writing file names in UTF-8, such that OS X / Linux / Windows 8 / 7-Zip / WinZip will always decode them correctly, see e.g. here.

This is what I want to achieve using libarchive, i.e. I would like to pass it my UTF-8 encoded pathname and have it store that in the zip file without doing any conversion.

I have added the "set locale" approach as an (unsatisfying) answer.

2

There are 2 answers

0
ValarDohaeris On

This is a workaround that will store path names using the system's locale settings, i.e. the resulting zip file can be decoded correctly on the same system, but is not portable.

This is not satisfying, I am just posting this to show that it is not what I am looking for.

Set the global locale to "" as explained here:

std::locale::global(std::locale(""));

and then read it back:

std::locale loc;
std::cout << loc.name() << std::endl;
// output: English_United States.1252
// may of course be different depending on system settings

Then set pathname by using archive_entry_update_pathname_utf8.

The zip file now contains file names encoded with Windows-1252, so my Windows can read them, but they appear as garbage on e.g. Linux.

Future

There is a libarchive issue for UTF-8 filenames. The whole story is quite complicated, but it sounds like they may add better UTF-8 support in libarchive 4.0.

1
Harald Koch On

I got UTF-8 filenames working in ZIP archives using libarchive-3.3.3, with using this exact flow (the sequence is important!):

entry = archive_entry_new();
archive_entry_set_pathname_utf8(entry, utf8Filename);
archive_entry_set_pathname(entry, utf8Filename);

When switching archive_entry_set_pathname_utf8 / archive_entry_set_pathname the entries are garbled in Windows Explorer's ZIP functionality. This worked for me for german umlauts (but should do for every UTF-8 character). This even worked for 2-byte and 3-byte UTF-8 characters (NFC/NFD).

//Addition The process must be run in an environment with a LANG variable set to a UTF-8 capable locale (i.e. "LANG=de_DE.UTF-8" in my case). Without this environment, the process won't generate correct UTF-8 characters.