Memory error when working with large text corpus

578 views Asked by At

I have a large text file (~450MB -> 129,000 lines and 457,000,000 chars) , when I try to work with this file after some while , Memory Error rises , here is my code:

docDict = {}
ind = 1

with open('somefile.txt',encoding='utf-8') as f:
    for line in f:
        data = line.split(' ')
        docDict[ind] = data
        ind+=1

I saw this , but I read file line by line.

2

There are 2 answers

0
Alexis Tacnet On

The memory error is raising here because even if you read your file line by line, you store its content in your dictionary docDict, so in the memory.

I don't know what you plan to do with this dictionary, but I recommend to do the process after each line you read, and then store the result in a variable (if the process compresses a lot), or directly in a file or database.

Hope I helped ! See ya !

5
japreiss On

To test the overhead of the data structure in your code, I wrote the following test program. It assumes your text file was N megabytes in ASCII encoding, with relatively short lines. (I had to change N from 450 to 150 after my physical memory ran out.)

import sys

MB = 1024 * 1024

line = "the quick brown fox jumps over the lazy dog"
megs = 150
nlines = (megs * MB) / len(line)

d = {}
for i in xrange(nlines):
    d[i] = line.split(' ')

dict_size = sys.getsizeof(d)
list_size = sum(sys.getsizeof(a) for a in d.items())
item_size = sum(sum(sys.getsizeof(s) for s in a) for a in d.items())

print " dict:", dict_size / float(MB), "MB"
print "lists:", list_size / float(MB), "MB"
print "items:", item_size / float(MB), "MB"
print "total:", (dict_size + list_size + item_size) / float(MB), "MB"

with the result:

 dict: 192.00 MB
lists: 251.16 MB
items: 669.77 MB
total: 1112.9 MB

watching the Activity Monitor, the Python process exceeds 2 gigabytes of memory usage, so there is also some memory not accounted for. Artifacts of the malloc implementation might be a possibility.

I implemented the same program in C++:

#include <string>
#include <vector>
#include <unordered_map>

int main()
{
    int const MB = 1024 * 1024;

    std::string const line = "the quick brown fox jumps over the lazy dog";
    std::vector<std::string> const split = {
        "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"
    };

    int const megs = 150;
    int const nlines = (megs * MB) / line.size();

    std::unordered_map<int, std::vector<std::string>> d;
    for (int i = 0; i < nlines; ++i) {
        d[i] = split;
    }
}

Compiled with clang++ -O3, this used about 1GB of memory. C++ doesn't have sys.getsizeof() so it requires a bit more work to break down the memory usage, and I didn't do that work.

Twice the memory of the equivalent C++ is actually a pretty good result for Python, so I'm removing my pre-edit comments about the cPython implementation.

I think your main problem is storing the line as an array of short strings. Is it possible for you to store the lines as whole strings and split them as needed, but not all at once?

What is the end goal of your program?