Is there any function like iconv in Python?

15.7k views Asked by At

I have some CSV files need to convert from shift-jis to utf-8.

Here is my code in PHP, which is successful transcode to readable text.

$str = utf8_decode($str);
$str = iconv('shift-jis', 'utf-8'. '//TRANSLIT', $str);
echo $str;

My problem is how to do same thing in Python.

3

There are 3 answers

4
Azurtree On BEST ANSWER

I don't know PHP, but does this work :

mystring.decode('shift-jis').encode('utf-8') ?

Also I assume the CSV content is from a file. There are a few options for opening a file in python.

with open(myfile, 'rb') as fin

would be the first and you would get data as it is

with open(myfile, 'r') as fin

would be the default file opening

Also I tried on my computed with a shift-js text and the following code worked :

with open("shift.txt" , "rb") as  fin :
    text = fin.read()

text.decode('shift-jis').encode('utf-8')

result was the following in UTF-8 (without any errors)

' \xe3\x81\xa6 \xe3\x81\xa7 \xe3\x81\xa8'

Ok I validate my solution :)

The first char is indeed the good character: "\xe3\x81\xa6" means "E3 81 A6" It gives the correct result.

enter image description here

You can try yourself at this URL

0
Zinob On

It would be helpful if you could post the string that you are trying to convert since this error suggest some problem with the in-data, older versions of PHP failed silently on broken input strings which makes this hard to diagnose.

According to the documentation this might also be due to differences in shift-jis dialects, try using 'shift_jisx0213' or 'shift_jis_2004' instead.

If using another dialect does not work you might get away with asking python to fail silently by using .decode('shift-jis','ignore') or .decode('shift-jis','replace') .

0
Jasen On

for when pythons built-in encodings are insufficient there's an iconv at PyPi.

pip install iconv

unfortunately the documentation is nonexistant.

There's also iconv_codecs

pip install iconv_codecs

eg:

>>> import iconv_codecs
>>> iconv_codecs.register('ansi_x3.110-1983')
>>> "foo".encode('ansi_x3.110-1983')