Using encode UTF-8 on File.read()

Asked by At

I am trying to upload a csv file to a postgresql database and it is stuck with a error you see at the end of my question. The reason is there are unicode characters in the file and it is encoded in windows-1252.

This is the line where I decode the file with UTF-8. However I would like to basically accept every encoding and decode it as UTF-8 or set the encoding to UTF-8 when reading the file and then decode with the line down below. I am not using open because I had problems with it, instead I am using InMemoryUploadedFile.read() (https://docs.djangoproject.com/en/2.2/ref/files/uploads/#django.core.files.uploadedfile.UploadedFile.read)

csv_file.seek(0)
file = csv_file.read().decode('utf-8').splitlines()
reader = csv.reader(file)

This is the error and it is because of this Character: d�mpe

'utf-8' codec can't decode byte 0xb3 in position 13969: invalid start byte

Any help would be appreciated.

2 Answers

1
AKX On Best Solutions

You can use the errors parameter to .decode() to ignore encoding errors or replace them with a replacement character.

csv_file.seek(0)
file = csv_file.read().decode('utf-8', errors='ignore').splitlines()
reader = csv.reader(file)

It would be better, of course, to fix the original file so that it is actual, correct UTF-8.

1
Antonis Christofides On

Python's bytes is a series of bytes, whereas str is a string of characters. This means that each item of a bytes object is a byte; whereas each item of a string object is a character.

This:

s = "dümpe"

creates a string of characters s. The second character of s, i.e. s[1], is ü.

Now I hear you wondering: the second character of s is ü assuming what encoding? You are asking the wrong question. Strings of characters are strings of characters, not strings of bytes. Strings do not have an encoding, they are just strings of characters.

Of course internally Python holds strings in an internal representation, but you don't need to care about that any more than you need to care about how it stores the number 3.14159. This is an implementation detail.

When you tell Python some_bytes_object.decode('utf-8'), this means "take this sequence of bytes, assume it is a string encoded in UTF-8, and get me that string".

In your case, all you need to do is .decode('win-1252'). If you want your program to accept any kind of encoding, you need to find a way for your program to get the information about what encoding each file has.

If this explanation is not clear enough, my series of blog posts on demystifying encodings can help.