Decode utf-8 in tarfile

1.6k views Asked by At

I have tar file which contains multibyte characters (japanese) . I am using libarchive to un tar the file . The filenames inside the tar files are encoded using utf-8 . When I try to untar the file the result always looses the multibyte characters .

I wrote a python script to achieve my result which worked

#!/usr/bin/python27

import tarfile
import pdb
def transform(data):
    u = data.decode('utf8')
    pdb.set_trace()
    #return u.encode('utf8')
    return u

tar = tarfile.open('abc.tar')
for m in tar.getmembers():
    print m.name
    m.name = transform(m.name)
    #print m.name

tar.extractall()

However I want to achieve the same in c++. This is an extract of the cpp code

while (entry = tar_file->nextEntry()) {
    fs::path filepath = path / entry->getFileName();  // loose the utf-8 character s here
    // So I tried the following 
    int wchars_num =  MultiByteToWideChar( CP_ACP , 0 , filepath.string().c_str() , -1, NULL , 0 );
    wchar_t* wstr = new wchar_t[wchars_num];

    //I tried UTF-8 as well in place of CP_ACP
    MultiByteToWideChar( CP_ACP , 0 , filepath.string().c_str() , -1, wstr , wchars_num );
    // But this did not help 
0

There are 0 answers