Unicode decode error with .csv file

1.5k views Asked by At

I have what is probably a really basic Python question.

I'm trying to write a script that eliminates a bunch of blank rows in some .csv files, and the script I've written works on about 90% of my files, but a few throw the following error at me:

Traceback (most recent call last):
  File "/Users/stephensmith/Documents/Permits/deleterows.py", line 17, in <module>
    deleteRow(file, "output/" + file)
  File "/Users/stephensmith/Documents/Permits/deleterows.py", line 8, in deleteRow
    for row in csv.reader(input):
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
  File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/encodings/utf_8_sig.py", line 69, in _buffer_decode
    return codecs.utf_8_decode(input, errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 6540: invalid start byte

Here's my code:

import csv
import os

def deleteRow(in_fnam, out_fnam):
    input = open(in_fnam, 'r')
    output = open(out_fnam, 'w')
    writer = csv.writer(output)
    for row in csv.reader(input):
        if any(row):
            writer.writerow(row)
    input.close()
    output.close()

for file in os.listdir("/Users/stephensmith/Documents/Permits/"):
    print(file)
    if file.endswith(".csv"):
        deleteRow(file, "output/" + file)

I've tried adding encoding='utf-8', ='ascii', and ='latin1' to both of my open() statements, but no luck. :-( Any idea what I'm doing wrong? The .csv files were created with Excel for Mac 2011, if that helps at all.

3

There are 3 answers

0
triphook On

Perhaps you could try looping through the csv files that are crashing with something like:

with open(file) as f:
    for line in f:
        print repr(line)

to see if any suspicious characters are popping up.

If you were able to identify the suspicious characters this way, say \0Xý1 pops up, you could clean the file by rewriting and replacing that character:

with open(file) as f:
    with open(file.rstrip(".csv") + "_fixed.csv") as g:
        for line in f:
            g.write(line.replace('\0Xý1', ''))

then try again with the cleaned file.

0
Serge Ballesta On

It is an encoding issue. The input csv file is not utf-8 encoded as your Python platform would expect. The problem is that without knowing its encoding, nor having an example of an offending line, I really cannot guess the encoding.

It is normal that encoding='utf8' and encoding='ascii' both breaked because the offending character is 0xa2 which is not in the ascii range (<= 0x7f) not a valid utf-8 character. But it is really weird that encoding='latin1' gives same error at same place because 0xa2 is ¢ in latin1.

IMHO, per this other SO post, you could try encoding='windows-1252', if it is supported by your platform.

If it still does not work, you should try to identify offending lines for latin1 :

class special_opener:
    def __init__(self, filename, encoding):
        self.fd = open(filename, 'rb')
        self.encoding = encoding
    def __enter__(self):
        return self
    def __exit__(self, exc_type, exc_value, traceback):
        return False
    def __next__(self):
        line = next(self.fd)
        try:
            return line.decode(self.encoding).strip('\r\n') + '\n'
        except Exception as e:
            print("Offending line : ", line, file = sys.stderr)
            raise e
    def __iter__(self):
        return self

def deleteRow(in_fnam, out_fnam):
    input = special_opener(in_fnam, 'latin1')
    output = open(out_fnam, 'w')
    writer = csv.writer(output)
    for row in csv.reader(input):
        if any(row):
            writer.writerow(row)
    input.close()
    output.close()

The special_opener should output something like :

Offending line :  b'a,\xe9,\xe8,d\r\n'
Traceback (most recent call last):
    ...

(this line is valid latin1, I got it with special_opener(file, 'utf8'))

Then you will be able to post the offending line here

0
ritiek On

I had a similar issue. In my case, the csv wasn't as large and opening it in LibreOffice Calc and saving it back fixed it.