Convertion between ISO-8859-2 and UTF-8 in Python

7k views Asked by At

I'm wondering how can I convert ISO-8859-2 (latin-2) characters (I mean integer or hex values that represents ISO-8859-2 encoded characters) to UTF-8 characters.

What I need to do with my project in python:

  1. Receive hex values from serial port, which are characters encoded in ISO-8859-2.
  2. Decode them, this is - get "standard" python unicode strings from them.
  3. Prepare and write xml file.

Using Python 3.4.3

txt_str = "ąęłóźć"
txt_str.decode('ISO-8859-2')
Traceback (most recent call last): File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

The main problem is still to prepare valid input for the "decode" method (it works in python 2.7.10, and thats the one I'm using in this project). How to prepare valid string from decimal value, which are Latin-2 code numbers?

Note that it would be uber complicated to receive utf-8 characters from serial port, thanks to devices I'm using and communication protocol limitations.

Sample data, on request:

68632057
62206A75
7A647261
B364206F
20616775
777A616E
616A2061
6A65696B
617A20B6
697A7970
6A65B361
70697020
77F36469
62202C79
6E647572
75206A65
7963696C
72656D75
6A616E20
73726F67
206A657A
65647572
77207972
73772065
00000069

This is some sample data. ISO-8859-2 pushed into uint32, 4 chars per int.

bit of code that manages unboxing:

l = l[7:].replace(",", "").replace(".", "").replace("\n","").replace("\r","") # crop string from uart, only data left
vl = [l[0:2], l[2:4], l[4:6], l[6:8]] # list of bytes
vl = vl[::-1] # reverse them - now in actual order

To get integer value out of hex string I can simply use:

int_vals = [int(hs, 16) for hs in vl]
3

There are 3 answers

1
user2046193 On

This topic is closed. Working code, that handles what need to be done:

x=177
x.to_bytes(1, byteorder='big').decode("ISO-8859-2")
3
Alastair McCormack On

Your example doesn't work because you've tried to use a str to hold bytes. In Python 3 you must use byte strings.

In reality, if you're using PySerial then you'll be reading byte strings anyway, which you can convert as required:

with serial.Serial('/dev/ttyS1', 19200, timeout=1) as ser:
    s = ser.read(10)
    # Py3: s == bytes
    # Py2.x: s == str
    my_unicode_string = s.decode('iso-8859-2')

If your iso-8895-2 data is actually then encoded to ASCII hex representation of the bytes, then you have to apply an extra layer of encoding:

with serial.Serial('/dev/ttyS1', 19200, timeout=1) as ser:
    hex_repr = ser.read(10)
    # Py3: hex_repr == bytes
    # Py2.x: hex_repr == str

    # Decodes hex representation to bytes
    # Eg. b"A3" = b'\xa3'
    hex_decoded = codecs.decode(hex_repr, "hex") 
    my_unicode_string = hex_decoded.decode('iso-8859-2')

Now you can pass my_unicode_string to your favourite XML library.

0
Mark Tolonen On

Interesting sample data. Ideally your sample data should be a direct print of the raw data received from PySerial. If you actually are receiving the raw bytes as 8-digit hexadecimal values, then:

#!python3
from binascii import unhexlify
data = b''.join(unhexlify(x)[::-1] for x in b'''\
68632057
62206A75
7A647261
B364206F
20616775
777A616E
616A2061
6A65696B
617A20B6
697A7970
6A65B361
70697020
77F36469
62202C79
6E647572
75206A65
7963696C
72656D75
6A616E20
73726F67
206A657A
65647572
77207972
73772065
00000069'''.splitlines())

print(data.decode('iso-8859-2'))

Output:

W chuj bardzo długa nazwa jakiejś zapyziałej pipidówy, brudnej ulicyumer najgorszej rudery we wsi

Google Translate of Polish to English:

The dick very long name some zapyziałej Small Town , dirty ulicyumer worst hovel in the village