How do I figure out what encoding was used to produce some garbled Chinese text?

1.4k views Asked by At

I have some text which was translated from English into Simplified Chinese. However, when I received the file back, the characters were garbled. So, for example, we have a line that reads "ΪÁËÓÐЧ¡¢¸ßЧµØʵÏÖÄ¿±ê£¬Äú×îÐèÒªµÄÊÇʲô£¿" rather than containing the Chinese characters I would expect.

I've tried pasting the above string into a Python interpreter, converting it to Unicode, and decoding with various Chinese character sets, to no avail. Does anyone have insight on this? Thank you.

1

There are 1 answers

6
Josh Lee On BEST ANSWER

Chardet:

>>> s = "ΪÁËÓÐЧ¡¢¸ßЧµØʵÏÖÄ¿±ê£¬Äú×îÐèÒªµÄÊÇʲô£¿"
>>> chardet.detect(s.encode('l1'))
{'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
>>> s.encode('l1').decode('gb2312')
'为了有效、高效地实现目标,您最需要的是什么?'