How to make sure a netcdf file is closed in python?

7.7k views Asked by At

It's probably simple, but I haven't been able to find a solution online... I'm trying to work with a series of datasets stored as netcdf files. I open each one up, read in some keys points, then move onto the next file. I am finding that I constantly hit mmap errors/the script slows down as more files are being read in. I believe it may be because the netcdf files are not being properly closed by the .close() command.

I've been testing this:

from scipy.io.netcdf import netcdf_file as ncfile
f=ncfile(netcdf_file,mode='r')
f.close()

then if I try

>>>f
<scipy.io.netcdf.netcdf_file object at 0x24d29e10>

and

>>>f.variables['temperature'][:]
array([ 1234.68034431,  1387.43136567,  1528.35794546, ...,  3393.91061952,
    3378.2844357 ,  3433.06715226])

So it appears the file is still open? What does close() actually do? how do I know it has worked? Is there a way to close/clear all open files from python?

Software: Python 2.7.6, scipy 0.13.2, netcdf 4.0.1

1

There are 1 answers

0
hpaulj On BEST ANSWER

The code for f.close is:

Definition: f.close(self)
Source:
    def close(self):
        """Closes the NetCDF file."""
        if not self.fp.closed:
            try:
                self.flush()
            finally:
                self.fp.close()

f.fp is the file object. So

In [451]: f.fp
Out[451]: <open file 'test.cdf', mode 'wb' at 0x939df40>

In [452]: f.close()

In [453]: f.fp
Out[453]: <closed file 'test.cdf', mode 'wb' at 0x939df40>

But I see from playing around with the f, that I can still create dimensions and variables. But f.flush() returns an error.

It does not look like it uses mmap during data writes, just during read.

def _read_var_array(self):
            ....
            if self.use_mmap:
                mm = mmap(self.fp.fileno(), begin_+a_size, access=ACCESS_READ)
                data = ndarray.__new__(ndarray, shape, dtype=dtype_,
                        buffer=mm, offset=begin_, order=0)
            else:
                pos = self.fp.tell()
                self.fp.seek(begin_)
                data = fromstring(self.fp.read(a_size), dtype=dtype_)
                data.shape = shape
                self.fp.seek(pos)

I don't have much experience with mmap. It looks like it sets up a mmap object based on a block of bytes in the file, and uses that as the data buffer for the variable. I don't know what happens to that access if the underlying file is closed. I wouldn't be surprised if there is some sort of mmap error.

If the file is opened with mmap=False, then the whole variable is read into memory, and accessed like a regular numpy array.

mmap : None or bool, optional
    Whether to mmap `filename` when reading.  Default is True
    when `filename` is a file name, False when `filename` is a
    file-like object

My guess is that if you open a file without specifying the mmap mode, read an variable from it, and then close the file, that it is unsafe to reference that variable and its data later. Any reference that requires loading more data could result in a mmap error.

But if you open the file with mmap=False, you should be able slice the variable even after closing the file.

I don't see how the mmap for one file or variable could interfer with access to other files and variables. But I'd have to read more on mmap to be sure of that.

And from the netcdf docs:

Note that when netcdf_file is used to open a file with mmap=True (default for read-only), arrays returned by it refer to data directly on the disk. The file should not be closed, and cannot be cleanly closed when asked, if such arrays are alive. You may want to copy data arrays obtained from mmapped Netcdf file if they are to be processed after the file is closed, see the example below.