I am trying to parse each line of a database file to get it ready for import. It has fixed width lines, but in characters, not in bytes. I have coded something based in Martineau's answer, but I am having trouble with the especial characters.
Sometimes they will break the expected width, some other times they will just throw UnicodeDecodeError. I believe the decode error could be fixed, but can I continue doing this struct.unpack
and correctly decode the especial characters? I think the problem is that they are encoded in multiple bytes, messing up with the expected field widths, which I understand to be in bytes and not in characters.
import os, csv
def ParseLine( arquivo):
import struct, string
format = "1x 12s 1x 18s 1x 16s"
expand = struct.Struct(format).unpack_from
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
for line in arquivo:
fields = unpack(line)
yield [x.strip() for x in fields]
Caminho = r"C:\Sample"
os.chdir(Caminho)
with open("Sample data.txt", 'r') as arq:
with open("Out" + ".csv", "w", newline ='') as sai:
Write = csv.writer(sai, delimiter= ";", quoting=csv.QUOTE_MINIMAL).writerows
for line in ParseLine(arq):
Write([line])
Sample data:
| field 1| field 2 | field 3 |
| sreaodrsa | raesodaso t.thl o| .tdosadot. osa |
| resaodra | rôn. 2x 17/220V | sreao.tttra v |
| esarod sê | raesodaso t.thl o| .tdosadot. osa |
| esarod sa í| raesodaso t.thl o| .tdosadot. osa |
Actual output:
field 1;field 2;field 3
sreaodrsa;raesodaso t.thl o;.tdosadot. osa
resaodra;rôn. 2x 17/22;V | sreao.tttra
In the output we see lines 1 and 2 are as expected. Line 3 got wrong widths, probably due to the multibyte ô
. Line 4 throws the following exception:
Traceback (most recent call last):
File "C:\Sample\FindSample.py", line 18, in <module>
for line in ParseLine(arq):
File "C:\Sample\FindSample.py", line 9, in ParseLine
fields = unpack(line)
File "C:\Sample\FindSample.py", line 7, in <lambda>
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
File "C:\Sample\FindSample.py", line 7, in <genexpr>
unpack = lambda line: tuple(s.decode() for s in expand(line.encode()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 11: unexpected end of data
I will need to to perform especific operations on each field, so I can't use a re.sub
on the whole file as I was doing before. I would like to keep this code, as it seems efficient and is in the brink of working. If there is some much more efficient way to parse, I could give it a try, though. I need to keep the special characters.
Indeed, the
struct
approach falls down here because it expects fields to be a fixed number of bytes wide, while your format uses a fixed number of codepoints.I'd not use
struct
here at all. Your lines are already decoded to Unicode values, just use slicing to extract your data:This deals entirely in characters in an already decoded line, rather than bytes. If you have field widths instead of indices, the
slice()
objects could also be generated:Demo: