Python equivalent for std::basic_istream. (Pybind11, how to read from file or input stream)

104 views Asked by At

I am using Pybind11/Nanobind to write Python bindings for my C++ libraries.

One of my C++ functions takes in the argument of type std::istream & e.g.:

std::string cPXGStreamReader::testReadStream(std::istream &stream)
{           
    std::ostringstream contentStream;
    std::string line;
    
    while (std::getline(stream, line)) {
        contentStream << line << '\n'; // Append line to the content string
    }
    
    return contentStream.str(); // Convert contentStream to string and return
}

What kind of argument do I need to pass in Python which corresponds to this?

I have tried passing s where s is created:

s = open(r"test_file.pxgf", "rb")

# and
s = io.BytesIO(b"some initial binary data: \x00\x01")

to no avail. I get the error

TypeError: test_read_file(): incompatible function arguments. The following argument types are supported:
    1. (self: pxgf.PXGStreamReader, arg0: std::basic_istream<char,std::char_traits<char> >) -> str

Invoked with: <pxgf.PXGStreamReader object at 0x000002986CF9C6B0>, <_io.BytesIO object at 0x000002986CF92250>

Did you forget to `#include <pybind11/stl.h>`? Or <pybind11/complex.h>,
<pybind11/functional.h>, <pybind11/chrono.h>, etc. Some automatic
1

There are 1 answers

1
Dan Mašek On

Pybind11 doesn't provide support for stream arguments out of the box, so a custom implementation needs to be developed. We will need some kind of an adapter that will allow us to create an istream, which reads from a Python object. There are two options that come to mind:

  1. Derive from std::streambuf and use that with std::istream (related Q&A).
  2. Using Boost.IOStreams, create a custom Source to be used with boost::iostreams::stream (which you can pass to your function without any modifications).

I'll focus on the latter option, and restrict the solution only to file-like objects (i.e. derived from io.IOBase) used for input.


Using Boost IOStreams

Source Implementation

We need to create a class satisfying the "Source" model of Boost IOStreams (there is a handy tutorial in the documentation about this subject). Basically something with the following characteristics:

struct Source {
    typedef char        char_type;
    typedef source_tag  category;
    std::streamsize read(char* s, std::streamsize n) 
    {
        // Read up to n characters from the input 
        // sequence into the buffer s, returning   
        // the number of characters read, or -1 
        // to indicate end-of-sequence.
    }
};

Constructor

In the constructor, we should store a reference to the Python object (pybind11::object) used as a data source. We should also verify that data source object is actually file-like.

py::object io_module = py::module::import("io");
py::object text_io_base = io_module.attr("TextIOBase");
py::object buffered_io_base = io_module.attr("BufferedIOBase");
py::object raw_io_base = io_module.attr("RawIOBase");
if (!(py::isinstance(source_obj, text_io_base)
    || py::isinstance(source_obj, buffered_io_base)
    || py::isinstance(source_obj, raw_io_base))) {
    throw std::invalid_argument("source_obj is not file-like");
}

Finally, we can cache the read attribute of the file-like object in another pybind11::object member variable, in order to avoid a lookup every time we want to call it.

read_source_obj_ = source_obj_.attr("read"); // Cache the function

read(...)

This function needs to read the requested number of bytes to the provided buffer, and signal End-Of-File when it's reached. The simplest approach is to cast the result of the Python file-like's read() method to std::string, and then copy its contents to the read buffer. It works for all both binary and text IO.

py::object result = read_source_obj_(n);
auto const payload = result.cast<std::string>();
if (payload.empty()) {
    return -1; // EOF
}
std::copy(payload.begin(), payload.end(), s);        
return payload.size();

However, this involves an unnecessary copy. If you're using recent enough C++ standard, you could cast to std::string_view instead. Otherwise, a somewhat lower-level approach may be used (based on the implementation of pybind11::string):

py::object result = read_source_obj_(n);
if (PyUnicode_Check(result.ptr())) {
    // Strings need to be converted to UTF8 first, giving us a bytes object
    result = py::reinterpret_steal<py::object>(PyUnicode_AsUTF8String(result.ptr()));
    if (!result) {
        throw py::error_already_set();
    }
}

assert(py::isinstance<py::bytes>(result));

// Get the internal buffer of the bytes object, to avoid an extra copy
char_type* buffer;
ssize_t buffer_size;
if (PYBIND11_BYTES_AS_STRING_AND_SIZE(result.ptr(), &buffer, &buffer_size)) {
    throw py::error_already_set();
}
if (buffer_size == 0) {
    return -1; // EOF
}
std::copy(buffer, buffer + buffer_size, s);
return buffer_size;

Sample Function Binding

Let's use a free-standing function modeled after your example to test this:

std::string test_read_stream(std::istream& stream)
{
    std::ostringstream result;
    std::string line;
    while (std::getline(stream, line)) {
        result << line << '\n';
    }
    return result.str();
}

Creating a Pybind11 typecaster for streams appears like opening yet another can of worms, so let's skip that. Instead, let's use a rather simple lambda when defining the function bindings. It simply needs to construct a boost::iostreams::stream using our source, and call the wrapped function.

m.def("test_read_stream", [](py::object file_like) -> std::string {
    boost::iostreams::stream<python_filelike_source> wrapper_stream(file_like);
    return test_read_stream(wrapper_stream);
});

Complete Source Code

Here it's all together:

#include <iosfwd>
#include <iostream>
#include <sstream>

#include <boost/iostreams/categories.hpp>
#include <boost/iostreams/stream.hpp>

#include <pybind11/pybind11.h>
#include <pybind11/embed.h>

namespace py = pybind11;

class python_filelike_source
{
public:
    typedef char char_type;
    typedef boost::iostreams::source_tag category;

    python_filelike_source(py::object& source_obj)
        : source_obj_(source_obj)
    {
        py::object io_module = py::module::import("io");
        py::object text_io_base = io_module.attr("TextIOBase");
        py::object buffered_io_base = io_module.attr("BufferedIOBase");
        py::object raw_io_base = io_module.attr("RawIOBase");
        if (!(py::isinstance(source_obj, text_io_base)
            || py::isinstance(source_obj, buffered_io_base)
            || py::isinstance(source_obj, raw_io_base))) {
            throw std::invalid_argument("source_obj is not file-like");
        }
        read_source_obj_ = source_obj_.attr("read"); // Cache the function
    }

    std::streamsize read(char_type* s, std::streamsize n)
    {
        if (n <= 0) {
            return 0;
        }

        py::object result = read_source_obj_(n);
        if (PyUnicode_Check(result.ptr())) {
            // Strings need to be converted to UTF8 first, giving us a bytes object
            result = py::reinterpret_steal<py::object>(PyUnicode_AsUTF8String(result.ptr()));
            if (!result) {
                throw py::error_already_set();
            }
        }

        assert(py::isinstance<py::bytes>(result));

        // Get the internal buffer of the bytes object, to avoid an extra copy
        char_type* buffer;
        ssize_t buffer_size;
        if (PYBIND11_BYTES_AS_STRING_AND_SIZE(result.ptr(), &buffer, &buffer_size)) {
            throw py::error_already_set();
        }
        if (buffer_size == 0) {
            return -1; // EOF
        }
        std::copy(buffer, buffer + buffer_size, s);
        return buffer_size;
    }

private:
    py::object& source_obj_;
    py::object read_source_obj_;
};


std::string test_read_stream(std::istream& stream)
{
    std::ostringstream result;
    std::string line;

    while (std::getline(stream, line)) {
        result << line << '\n';
    }

    return result.str();
}

PYBIND11_EMBEDDED_MODULE(testmodule, m)
{
    m.def("test_read_stream", [](py::object file_like) -> std::string {
        boost::iostreams::stream<python_filelike_source> wrapper_stream(file_like);
        return test_read_stream(wrapper_stream);
    });
}

int main()
{
    py::scoped_interpreter guard{};

    try {
        py::exec(R"(\
import testmodule
import io

print("BytesIO:")
s = io.BytesIO(b"One\nTwo\nThree")
print(testmodule.test_read_stream(s))

print("StringIO:")
s = io.StringIO("Foo\nBar\nBaz")
print(testmodule.test_read_stream(s))

print("String:")
try:
    print(testmodule.test_read_stream("abcd"))
except Exception as exc:
    print(exc)

print("File:")
with open("example_file.txt", "w") as f:
    f.write("TEST1\nTEST2\nTEST3")
with open("example_file.txt", "r") as f:
    print(testmodule.test_read_stream(f))
)");
    } catch (py::error_already_set& e) {
        std::cerr << e.what() << "\n";
    }
}

Running it produces the following output:

BytesIO:
One
Two
Three

StringIO:
Foo
Bar
Baz

String:
source_obj is not file-like
File:
TEST1
TEST2
TEST3

Note: Due to buffering inherent in istream, it will usually read from the Python object in chunks (e.g. here it asks for 4096 bytes every time). If you make partial reads of the stream (e.g. just get a single line), you will need to re-adjust the file-like's read position explicitly to reflect the number of bytes actually consumed.


Not Using Boost

Pybind11 source code contains an output streambuf implementation that writes to Python streams. This can be a decent inspiration to get started. I managed to find an existing streambuf implementation for input in BlueBrain/nmodl repository on GitHub. You would use it in the following manner (probably as part of a wrapper lambda used to dispatch your C++ function):

py::object text_io_base = py::module::import("io").attr("TextIOBase");
if (py::isinstance(object, text_io_base)) {
    py::detail::pythonibuf<py::str> buf(object);
    std::istream istr(&buf);
    return testReadStream(istr);
} else {
    py::detail::pythonibuf<py::bytes> buf(object);
    std::istream istr(&buf);
    return testReadStream(istr);
}