I have one source of data, that I don't control, and that sends strings with different encodings, and I have no way to know the encoding in advance! I would need to know the format to be able to correctly decode and store properly in a format that I understand and control, let's say UTF-8.
for example:
- "J'ai déjÃ\xa0 un problème, après... je ne sais pas"
should read
- "J'ai déjà un problème, après... je ne sais pas"
What I have tried:
> stringToTest="J'ai déjÃ\xa0 un problème, après... je ne sais pas"
# there is no decode for string, directly, but one can try
> stringToTest.encode().decode()
"J'ai déjÃ\xa0 un problème, après... je ne sais pas"
# what does not help :)
From trial and error, I found that the encoding is 'iso-8859-1'
> stringToTest.encode('iso-8859-1').decode()
"J'ai déjà un problème, après... je ne sais pas"
What I want/need is to find the 'iso-8859-1' automatically!
I tried to use chardet!
> import chardet
> chardet.detect(stringToTest)
Traceback (most recent call last):
File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
'{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>
But... as it is a string... chardet does not accept it! And, I am ashamed to admit, but I don't manage to convert the string into something that chardet accepts!
> test1=b"J'ai déjà un problème, après... je ne sais pas"
File "<input>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
# Ok str and unicode are similar things... but who knows?!?!
> test1=u"J'ai déjà un problème, après... je ne sais pas"
> chardet.detect(test1)
Traceback (most recent call last):
File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
File "/usr/lib/python3/dist-packages/chardet/__init__.py", line 34, in detect
'{0}'.format(type(byte_str)))
TypeError: Expected object of type bytes or bytearray, got: <class 'str'>
# NOP
> bytes(stringToTest)
Traceback (most recent call last):
File "/snap/pycharm-community/188/plugins/python-ce/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
TypeError: string argument without an encoding
Why not unidecode?!?
from unidecode import unidecode
from unidecode import unidecode
unidecode(stringToTest)
'J\'ai dA(c)jA un problA"me, aprA"s... je ne sais pas'
The string
is an example of mojibake - encoded text (
bytes
) which has been decoded with the wrong encoding. In this particular case, the string was originally encoded as UTF-8; re-encoding as ISO-8859-1 (latin-1) recreates the UTF-8 bytes, and decoding from UTF-8 (the default in Python3) produces the expected result.If you are receiving these mojibake strings from an external source, you can safely encode them using ISO-8859-1 to recreate the original bytes. The bytes - encoded text - may be passed to
chardet
for analysis.