Read binary file which has different datatypes

3.9k views Asked by At

Attempting to read a binary file produced in Fortran into Python, which has some integers, some reals and logicals. At the moment I read the first few numbers correctly with:

x = np.fromfile(filein, dtype=np.int32, count=-1)
firstint= x[1]
...

(np is numpy). But the next item is a logical. And later on ints again and after reals. How can I do it?

1

There are 1 answers

3
Joe Kington On BEST ANSWER

Typically, when you're reading in values such as this, they're in a regular pattern (e.g. an array of C-like structs).

Another common case is a short header of various values followed by a bunch of homogenously typed data.

Let's deal with the first case first.

Reading in Regular Patterns of Data Types

For example, you might have something like:

float, float, int, int, bool, float, float, int, int, bool, ...

If that's the case, you can define the a dtype to match the pattern of types. In the case above, it might look like:

dtype=[('a', float), ('b', float), ('c', int), ('d', int), ('e', bool)]

(Note: there are many different ways to define the dtype. For example, you could also write that as np.dtype('f8,f8,i8,i8,?'). See the documentation for numpy.dtype for more information.)

When you read your array in, it will be a structured array with named fields. You can later split it up into individual arrays if you'd prefer. (e.g. series1 = data['a'] with the dtype defined above)

The main advantage of this is that reading in your data from disk will be very fast. Numpy will simply read everything into memory, and then interpret the memory buffer according to the pattern you specified.

The drawback is that structured arrays behave a bit differently than regular arrays. If you're not used to them, they'll probably seem confusing at first. The key part to remember is that each item in the array is one of the patterns that you specified. For example, for what I showed above, data[0] might be something like (4.3, -1.2298, 200, 456, False).

Reading in a Header

Another common case is that you have a header with a know format and then a long series of regular data. You can still use np.fromfile for this, but you'll need to parse the header seperately.

First, read in the header. You can do this in several different ways (e.g. have a look at the struct module in addition to np.fromfile, though either will probably work well for your purposes).

After that, when you pass the file object to fromfile, the file's internal position (i.e. the position controlled by f.seek) will be at the end of the header and start of the data. If all of the rest of the file is a homogenously-typed array, a single call to np.fromfile(f, dtype) is all you need.

As a quick example, you might have something like the following:

import numpy as np

# Let's say we have a file with a 512 byte header, the 
# first 16 bytes of which are the width and height 
# stored as big-endian 64-bit integers.  The rest of the
# "main" data array is stored as little-endian 32-bit floats

with open('data.dat', 'r') as f:
    width, height = np.fromfile(f, dtype='>i8', count=2)
    # Seek to the end of the header and ignore the rest of it
    f.seek(512)
    data = np.fromfile(f, dtype=np.float32)

# Presumably we'd want to reshape the data into a 2D array:
data = data.reshape((height, width))