Using Chardet to find encoding of very large file

6.1k views Asked by At

I'm trying to use Chardet to deduce the encoding of a very large file (>4 million rows) in tab delimited format.

At the moment, my script struggles presumably due to the size of the file. I'd like to narrow it down to loading the first x number of rows of the file, potentially, but I'm having difficulty when I tried to use readline().

The script as it stands is:

import chardet
import os
filepath = os.path.join(r"O:\Song Pop\01 Originals\2017\FreshPlanet_SongPop_0517.txt")
rawdata = open(filepath, 'rb').readline()


print(rawdata)
result = chardet.detect(rawdata)
print(result)

It works, but it only reads the first line of the file. My foray into using simple loops to call readline() more than once didn't work so well (perhaps the fact that the script is opening the file in binary format).

The output on one line is {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

I was wondering whether increasing the number of lines it reads would improve the encoding confidence.

Any help would be greatly appreciated.

5

There are 5 answers

0
Lovethenakedgun On BEST ANSWER

I'm by no means particularly experienced with Chardet, but came across this post while debugging an issue of my own, and was surprised that it didn't have any answers. Sorry if this is too late to be of any help for the OP, but for anyone else that stumbles across this:

I'm not sure as to whether reading in more of the file would improve guessed encoding type, but all you'd need to do to test it would be:

import chardet
testStr = b''
count = 0
with open('Huge File!', 'rb') as x:
    line = x.readline()
    while line and count < 50:  #Set based on lines you'd want to check
        testStr = testStr + line
        count = count + 1
        line = x.readline()
print(chardet.detect(testStr))

In my instance, I had a file that I believed had multiple encoding formats, and wrote the following to test it "line-by-line". Edit: Although I later found that a "line-by-line" approach seems to also cause Chardet to suggest some "false-positives".

import chardet
with open('Huge File!', 'rb') as x:
    line = x.readline()
    curChar = chardet.detect(line)
    print(curChar)
    while line:
        if curChar != chardet.detect(line):
            curChar = chardet.detect(line)
            print(curChar)
        line = x.readline()
0
Luiz Mitidiero On

Unfortunately, Chardet is too slow. Using a sample of lines might result in false positives.

The best replacement for Chardet I found is the lib charset-normalizer. It has a quick replacement for Chardet detect but also provides a different approach, the one that the best encoding is the one that works. It is easy to migrate and seems to be a better approach.

1
coderguy On
import chardet

with open(filepath, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result
0
Alon Barad On

Another example without loading file to memory using python-magic package

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

1
Felix Martinez On

Another example with UniversalDetector:

#!/usr/bin/env python
from chardet.universaldetector import UniversalDetector


def detect_encode(file):
    detector = UniversalDetector()
    detector.reset()
    with open(file, 'rb') as f:
        for row in f:
            detector.feed(row)
            if detector.done: break

    detector.close()
    return detector.result

if __name__ == '__main__':
    print(detect_encode('example_file.csv'))

Breaks when confidence = 1.0 . Useful for very large files.