Can reference counting be relied on to close a file in Python?

505 views Asked by At

In this question "Generating an MD5 checksum of a file", I had this code:

import hashlib
def hashfile(afile, hasher, blocksize=65536):
    buf = afile.read(blocksize)
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    return hasher.digest()

[(fname, hashfile(open(fname, 'rb'), hashlib.sha256())) for fname in fnamelst]

I was criticized for opening a file inside of a list comprehension, and one person opined that if I had a long enough list I would run out of open file handles. Interfaces which significantly reduced hashfile's flexibility and had hashfile taking a filename argument and using with were suggested.

Were these necessary? Was I really doing something that wrong?

Testing out this code:

#!/usr/bin/python3

import sys
from pprint import pprint # Pretty printing

class HereAndGone(object):
    def __init__(self, i):
        print("%d %x -> coming into existence." % (i, id(self)),
              file=sys.stderr)
        self.i_ = i
    def __del__(self):
        print("%d %x <- going away now." % (self.i_, id(self)),
              file=sys.stderr)

def do_nothing(hag):
    return id(hag)

l = [(i, do_nothing(HereAndGone(i))) for i in range(0, 10)]

pprint(l)

results in this output:

0 7f0346decef0 -> coming into existence.
0 7f0346decef0 <- going away now.
1 7f0346decef0 -> coming into existence.
1 7f0346decef0 <- going away now.
2 7f0346decef0 -> coming into existence.
2 7f0346decef0 <- going away now.
3 7f0346decef0 -> coming into existence.
3 7f0346decef0 <- going away now.
4 7f0346decef0 -> coming into existence.
4 7f0346decef0 <- going away now.
5 7f0346decef0 -> coming into existence.
5 7f0346decef0 <- going away now.
6 7f0346decef0 -> coming into existence.
6 7f0346decef0 <- going away now.
7 7f0346decef0 -> coming into existence.
7 7f0346decef0 <- going away now.
8 7f0346decef0 -> coming into existence.
8 7f0346decef0 <- going away now.
9 7f0346decef0 -> coming into existence.
9 7f0346decef0 <- going away now.
[(0, 139652050636528),
 (1, 139652050636528),
 (2, 139652050636528),
 (3, 139652050636528),
 (4, 139652050636528),
 (5, 139652050636528),
 (6, 139652050636528),
 (7, 139652050636528),
 (8, 139652050636528),
 (9, 139652050636528)]

It's obvious that each HereAndGone object is being created and destroyed as each element of the list comprehension is constructed. Python reference counting frees the object as soon as there are no references to it, which happens immediately after the value for that list element is computed.

Of course, maybe some other Python implementations don't do this. Is it required for a Python implementation to do some form of reference counting? It certainly seems from the documentation of the gc module like reference counting is a core feature of the language.

And, if I did do something wrong, how would you suggest I re-write it to retain the succinct clarity of a list comprehension and the flexibility of an interface that works with anything that can be read like a file?

1

There are 1 answers

0
Omnifarious On BEST ANSWER

Someone pointed to the Data Model section of the Python Language Reference where it says very explicitly "Do not depend on immediate finalization of objects when they become unreachable (so you should always close files explicitly).". So, this makes it clear that the code is relying on behavior that's not guaranteed.

And even if it were, it's still fragile. It invisibly relies on the file to never be referenced by a data structure that has a circular reference or a lifetime beyond the hashing of an individual file. Who knows what might happen to the code in the future and whether someone will remember this key detail?

The question is what to do about it. The hashfile function in the question is nicely flexible, and it seems a shame to fiddle its interface to take a filename and have it open the file inside the function and thereby kill its flexibility. I think the minimal solution is this:

I feel the solution is to rethink the interface a bit and make it even more general.

def hash_bytestr_iter(hasher, bytesiter, ashexstr=False):
    for block in bytesiter:
        hasher.update(bytesiter)
    return (hasher.hexdigest() if ashexstr else hasher.digest())

def iter_and_close_file(afile, blocksize=65536):
    with afile:
        block = afile.read(blocksize)
        while len(block) > 0:
            yield block

One could just make the original hashfile use the passed in afile as a context manager, but I feel this sort of breaks expectations in a subtle way. It makes hashfile close the file, and its name sort of promises just that it will compute a hash, not close the file.

And I suspect there are a lot of cases when you have blocks of bytes and would like to hash them all as if they were part of a continuous block or stream of them. Hashing an iterator over blocks of bytes is even more general then hashing a file.

And similarly I think there are a lot of cases where you would like to iterate over a file-like object and then close it. So that makes both of these functions re-usable and general.