Why is my C module leaking memory?

500 views Asked by At

I'm reading lists from a large file, which I eventually want to store as array.arrays. Because

map(int, line.split())

is very slow, I wrote a small C module which does strtok and a faster version of atoi:

inline long
minhashTables_myatoi(const char* s)
{
    int r;
    for (r = 0; *s; r = r * 10 + *s++ - '0');
    return r;
}

static PyObject*
minhashTables_ints(PyObject *self, PyObject *args)
{
    char* s;
    Py_ssize_t slen;

    if(!PyArg_ParseTuple(args, "s#", &s, &slen))
        return NULL;

    long* buf = malloc(sizeof(long) * (slen+1)/2);

    const char* tok = strtok(s, " ");
    buf[0] = minhashTables_myatoi(tok);
    Py_ssize_t i;
    for(i = 1; (tok = strtok(NULL, " ")) != NULL; i++)
        buf[i] = minhashTables_myatoi(tok);

    Py_ssize_t buflen = i;
    PyObject* list = PyList_New(buflen);
    PyObject *o;
        for(i = 0; i < buflen; i++)
    {
        o = PyInt_FromLong(buf[i]);
        PyList_SET_ITEM(list, i, o);
    }
    free(buf);

    return list;
}

So my python script calls ints() with a string and passes it to the array.array constructor and saves the resulting array in a list.

My problem is, that now the script leaks memory, which it did not with the map instead of the ints() function, of course.

Also using my own version of Pythons int() using a C module does not leak memory.

Thanks for your help!

Edit: To valgrind the module I used this script:

import minhashTables

data = ' '.join(map(str, range(10)))
print 'start'
foo = minhashTables.ints(data)
del data
del foo
print 'stop'

And I run valgrind --tool=memcheck --leak-check=full --show-reachable=yes python test.py, but there so no output from valgrind between start and stop, through there are tons before and afterwards.

Edit: Code for confirming it's leaking: import minhashTables

for i in xrange(1000000000):
    data = ' '.join(map(str, range(10, 10000)))
    foo = minhashTables.ints(data)

I have to recreate the string, because strtok changes it. By the way, copying the string into another memory location doesn't change the behavior.

3

There are 3 answers

3
Simonne On

Try this

inline long
    minhashTables_myatoi(const char* s)
    {
        long result=0;
        while((*s)!='\0'){
            result = result * 10 + (*s- '0');
            s++;
        }
        return result;
    }
10
GrahamS On

I suggest you take a look at Valgrind - it is a very useful tool for getting to the bottom of memory leaks in C.

4
GrahamS On

Do you really need to malloc off space for all those longs?

I'm not familiar with the Python/C API, so this might be terrible advice, but can't you just iterate over the string and append each long you find onto the list as you go?

i.e. take this code:

static const char* const testString = "12 345  67  8 910 11 1213 141516, 1718";

int main()
{
    const char* i = testString;
    long parseLong = 0;
    int gotLong = 0;

    for (;*i;++i)
    {
        if ('0' <= *i && *i <= '9')
        {
            parseLong = (parseLong * 10) + (*i - '0');
            gotLong = 1;
        }
        else if (gotLong)
        {
            printf("Got: %d\n", parseLong);
            parseLong = 0;
            gotLong = 0;
        }
    }

    if (gotLong)
        printf("Got: %d\n", parseLong);
}

And then replace the printf with some suitable pythony-goodness like PyList_Append().

As well as avoiding malloc, using less memory and being able to safely operate directly on the constant Python string, this code also handles corner cases like empty strings, multiple spaces and other separators between numbers.


Edit: Counting the Longs If you want to count the number of longs first, so you can alloc the correct length of Python List then you could just add in something like this:

    for (i = testString;*i;++i)
    {
        const int isdigitoflong = isdigit(*i);

        if (!gotLong && isdigitoflong)
            longCount++;

        gotLong = isdigitoflong;
    }

Which should be relatively quick.


Edit 2: Better Parser
Here's a slightly better version of the parser above that's a bit more compact, doesn't need gotLong and doesn't have to repeat code to deal with the final long:

    for (i = testString;*i;++i)
    {
        if (isdigit(*i))
        {
            do {
                parseLong = (parseLong * 10) + (*i - '0');
            } while (*++i && isdigit(*i));

            printf("Got: %d\n", parseLong);
            parseLong = 0;
        }
    }