Protocol Buffer ParseFromString function not reading complete binary file in Python

2.7k views Asked by At

I am testing out Protocol buffers and trying to read a csv file, serialize it and write the output to a binary file and then read the binary file using ParseFromString. I am able to serialize and write the binary file however on reading it gives an index out of bounds exception or in the other case it just outputs the last line of the binary file, it skips everything before it.

My message is simple, it has two fields, time and metricusage.

syntax="proto3";

message excelData {

string time=1;
string meterusage=2;
}

The serialization and writing to a binary file code is below:

import metric_pb2 
import sys
from csv import reader
 

excel_data=metric_pb2.excelData()

with open('out.bin', 'wb') as f:
    with open('data.csv', 'r') as read_obj:
        csv_reader = reader(read_obj)
        header = next(csv_reader)
        if header != None:
            for row in csv_reader:
                excel_data.time=row[0]
                excel_data.meterusage=row[1]
                f.write(excel_data.SerializeToString())

f.close()
read_obj.close()

The troublesome part is below:

Approach 1: This only returns the last line of the binary file. It skips everything before it.

Just one row in the answer set as opposed to the entire binary file

excel_data=metric_pb2.excelData()

with open('out.bin', 'rb') as f:
    content=f.read()
    excel_data.ParseFromString(content)
    print(excel_data.time)
    print(excel_data.meterusage)

Approach 2: If I read the serialized binary file like the csv file above it gives me an index out of bound error. My inclination is that maybe the binary file is byte data and does not contain string data types it is giving this error?

What's the correct way to read this binary file using message.ParseFromString() because reading it via a loop doesn't work, nor reading it as whole works? A snapshot of my created binary file is below:

Binary output

1

There are 1 answers

0
DazWilkin On BEST ANSWER

Were you successful?

Here's a hacky solution for you that (per Protobuf techniques for streaming multiple messages) writes the (variable!) message length as bytes before each record.

Writer

import metric_pb2
import sys

from csv import reader

excel_data = metric_pb2.excelData()

with open('out.bin', 'wb') as f:
    with open('data.csv', 'r') as read_obj:
        csv_reader = reader(read_obj)
        header = next(csv_reader)
        if header != None:
            for row in csv_reader:
                excel_data.time = row[0]
                excel_data.meterusage = row[1]
                bytes = excel_data.SerializeToString()
                # Write the message's integer length as bytes
                f.write(len(bytes).to_bytes(1, sys.byteorder))
                # Write the message itself as bytes
                f.write(bytes)

f.close()
read_obj.close()

Produces:

00000000: 1c 0a13 3230 3231 2d30 312d 3031 2030 303a 3030 3a30 3012 0535 342e 3635  ...2021-01-01 00:00:00..54.65
00000010: 1c 0a13 3230 3231 2d30 312d 3031 2030 303a 3030 3a30 3012 0535 352e 3138  ...2021-01-01 00:00:00..55.18
00000030: 1b 0a13 3230 3231 2d30 312d 3031 2030 303a 3030 3a30 3012 0435 352e 38    ...2021-01-01 00:00:00..55.8

NOTE 1c == 28 (because 54.65 and 55.18) and 1b == 27 (because 55.8)

Reader

import metric_pb2
import sys

excel_data = metric_pb2.excelData()

with open('out.bin', 'rb') as f:
    while True:
        # Read the message's length as bytes and convert it to an integer
        len = int.from_bytes(f.read(1), sys.byteorder)
        # Read that number of bytes as the message bytes
        bytes = f.read(len)
        if not bytes:
            break

        excel_data.ParseFromString(bytes)
        print("[{time}] {meterusage}".format(
            time=excel_data.time,
            meterusage=excel_data.meterusage))

Produces:

[2021-01-01 00:00:00] 54.65
[2021-01-01 00:00:00] 55.18
[2021-01-01 00:00:00] 55.8
[2021-01-01 00:00:00] 56.0
[2021-01-01 00:00:00] 63.52
[2021-01-01 00:00:00] 78.1