Why is this script slowing down per item with increased amount of input?

58 views Asked by At

Consider the following program:

#!/usr/bin/env pypy

import json
import cStringIO
import sys

def main():
    BUFSIZE = 10240
    f = sys.stdin
    decoder = json.JSONDecoder()
    io = cStringIO.StringIO()

    do_continue = True
    while True:
        read = f.read(BUFSIZE)
        if len(read) < BUFSIZE:
            do_continue = False
        io.write(read)
        try:
            data, offset = decoder.raw_decode(io.getvalue())
            print(data)
            rest = io.getvalue()[offset:]
            if rest.startswith('\n'):
                rest = rest[1:]
            decoder = json.JSONDecoder()
            io = cStringIO.StringIO()
            io.write(rest)
        except ValueError, e:
            #print(e)
            #print(repr(io.getvalue()))
            continue
        if not do_continue:
            break

if __name__ == '__main__':
    main()

And here's a test case:

$ yes '{}' | pv | pypy  parser-test.py >/dev/null

As you can see, the following script slows down when you add more input to it. This also happens with cPython. I tried to profile the script using mprof and cProfile, but I found no hint on why is that. Does anybody have a clue?

2

There are 2 answers

0
d33tah On BEST ANSWER

Apparently the string operations slowed it down. Instead of:

        data, offset = decoder.raw_decode(io.getvalue())
        print(data)
        rest = io.getvalue()[offset:]
        if rest.startswith('\n'):
            rest = rest[1:]

It is better to do:

        data, offset = decoder.raw_decode(io.read())
        print(data)
        rest = io.getvalue()[offset:]
        io.truncate()
        io.write(rest)
        if rest.startswith('\n'):
            io.seek(1)
1
eleventhend On

You may want to close your StringIO at the end of the iteration (after writing).

io.close()

The memory buffer for a StringIO will free once it is closed, but will stay open otherwise. This would explain why each additional input is slowing your script down.