I'm trying to convert a localization file that contains Chinese characters such that the chinese characters are converted into latin1 encoding.
However, when I run the python script I get this error...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xb9 in position 0: ordinal not in range(128)
Here's my python script, it essentially just takes the users input to convert said file. Then converts the file (all lines that start with a [ or are empty, should be skipped)... The part that needs to be converted is always at index 1 in an list.
# coding: utf8
# Enter File Name
file_name = raw_input('Enter File Path/Name To Convert: ')
# Open the File we Write too...
write_file = open(file_name + "_temp", 'w+')
# Open the File we Read From...
read_file = open(file_name)
with open(file_name) as file_to_write:
for line in file_to_write:
# We ignore any line that starts with [] or is empty...
if line and line[0:1] != '[':
split_string = line.split("=")
if len(split_string) == 2:
write_file.write(split_string[0] + "=" + split_string[1].encode('gbk').decode('latin1') + "\n")
else:
write_file.write(line)
else:
write_file.write(line)
# Close File we Write too..
write_file.close()
# Close File we read too..
read_file.close()
And example config file is...
[Example]
Password=密碼
The output should be converted into...
[Example]
Password=±K½X
The Latin1 encoding cannot represent chinese characters. The better you can get if all you have for utput is latin1, are escape sequences.
You are using Python 2.x - Python3.x open the files as text, and automatically decodes the read bytes to (unicode) strings on reading.
In Python2, when you read a file you get the bytes - you are responsible for decoding these bytes to text (unicode objects in Python 2.x) - processing them, and re-encoding them to the desired encoding upon recording the information to another file.
So, the line that reads:
Should be:
instead.
now, note that I added the parameters
errors="escape"to thedecodecall - as I said above remains true: latin1 is a characterset of 233 or so characters - it does contain the latin letters, and the most used accented characters ("á é í ó ú ç ã ñ"... etc), some puntuaction and math symbols, but no characters for other languages.If you have to represent these as text, you should use the
utf-8encoding - and configure whatever software you are using to consume the generated file to that encoding instead.That said, what you are doing is just a horrible practice. Unless you are opening a really nightmarish file which is known to contain text in different encoding, you should just decode all text to unicode, and them re-encode it all - not just the part of the data which is meant to have non-ASCII characters. Just don't do that if the you have other, gbk incompatible, characters in the original file, otherwise, your inner loop could as well be:
As for your "example output" - that is just the _very_same file, i.e. the same bytes on the first file. The program displaying the line that goes:"Password=密碼" is "seeing" the file with the GBK encoding, and the other program is "seeing" the exact same bytes, but interpreting them as latin1. You should not have to convert from one to the other.