Unpack fixed width unicode file line with special characters. Python UnicodeDecodeError

Question

Unpack fixed width unicode file line with special characters. Python UnicodeDecodeError

825 views Asked by mvbentes At 21 November 2014 at 15:46

I am trying to parse each line of a database file to get it ready for import. It has fixed width lines, but in characters, not in bytes. I have coded something based in Martineau's answer, but I am having trouble with the especial characters.

Sometimes they will break the expected width, some other times they will just throw UnicodeDecodeError. I believe the decode error could be fixed, but can I continue doing this struct.unpack and correctly decode the especial characters? I think the problem is that they are encoded in multiple bytes, messing up with the expected field widths, which I understand to be in bytes and not in characters.

import os, csv

def ParseLine( arquivo):
    import struct, string   
    format = "1x 12s 1x 18s 1x 16s"
    expand = struct.Struct(format).unpack_from
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
    for line in arquivo:
        fields = unpack(line)
        yield [x.strip() for x in fields]

Caminho = r"C:\Sample"
os.chdir(Caminho)

with open("Sample data.txt", 'r') as arq: 
    with open("Out" + ".csv", "w", newline ='') as sai: 
        Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows
        for line in ParseLine(arq): 
            Write([line])

Sample data:

|     field 1|      field 2     |     field 3    |
| sreaodrsa  | raesodaso t.thl o| .tdosadot. osa |
| resaodra   | rôn. 2x  17/220V | sreao.tttra v  |
| esarod sê  | raesodaso t.thl o| .tdosadot. osa |
| esarod sa í| raesodaso t.thl o| .tdosadot. osa |

Actual output:

field 1;field 2;field 3
sreaodrsa;raesodaso t.thl o;.tdosadot. osa
resaodra;rôn. 2x  17/22;V | sreao.tttra

In the output we see lines 1 and 2 are as expected. Line 3 got wrong widths, probably due to the multibyte ô. Line 4 throws the following exception:

Traceback (most recent call last):
  File "C:\Sample\FindSample.py", line 18, in <module>
    for line in ParseLine(arq):
  File "C:\Sample\FindSample.py", line 9, in ParseLine
    fields = unpack(line)
  File "C:\Sample\FindSample.py", line 7, in <lambda>
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
  File "C:\Sample\FindSample.py", line 7, in <genexpr>
    unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data

I will need to to perform especific operations on each field, so I can't use a re.sub on the whole file as I was doing before. I would like to keep this code, as it seems efficient and is in the brink of working. If there is some much more efficient way to parse, I could give it a try, though. I need to keep the special characters.

Original Q&A

There are 1 answers

**Martijn Pieters** · Accepted Answer · 2014-11-21T15:52:30+00:00

Indeed, the struct approach falls down here because it expects fields to be a fixed number of bytes wide, while your format uses a fixed number of codepoints.

I'd not use struct here at all. Your lines are already decoded to Unicode values, just use slicing to extract your data:

def ParseLine(arquivo):
    slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
    for line in arquivo:
        yield [line[s].strip() for s in slices]

This deals entirely in characters in an already decoded line, rather than bytes. If you have field widths instead of indices, the slice() objects could also be generated:

def widths_to_slices(widths):
    pos = 0
    for width in widths:
        pos += 1  # delimiter
        yield slice(pos, pos + width)
        pos += width

def ParseLine(arquivo):
    widths = (12, 18, 16)
    for line in arquivo:
        yield [line[s].strip() for s in widths_to_slices(widths)]

Demo:

>>> sample = '''\
... |     field 1|      field 2     |     field 3    |
... | sreaodrsa  | raesodaso t.thl o| .tdosadot. osa |
... | resaodra   | rôn. 2x  17/220V | sreao.tttra v  |
... | esarod sê  | raesodaso t.thl o| .tdosadot. osa |
... | esarod sa í| raesodaso t.thl o| .tdosadot. osa |
... '''.splitlines()
>>> def ParseLine(arquivo):
...     slices = [slice(1, 13), slice(14, 32), slice(33, 49)]
...     for line in arquivo:
...         yield [line[s].strip() for s in slices]
... 
>>> for line in ParseLine(sample):
...     print(line)
... 
['field 1', 'field 2', 'field 3']
['sreaodrsa', 'raesodaso t.thl o', '.tdosadot. osa']
['resaodra', 'rôn. 2x  17/220V', 'sreao.tttra v']
['esarod sê', 'raesodaso t.thl o', '.tdosadot. osa']
['esarod sa í', 'raesodaso t.thl o', '.tdosadot. osa']

TechQA.

Unpack fixed width unicode file line with special characters. Python UnicodeDecodeError

There are 1 answers

Related Questions in PYTHON

Related Questions in UNICODE

Related Questions in FIXED-LENGTH-RECORD

Popular Questions

Popular Tags

Trending Questions