How to write/read binary files that represent objects?

1k views Asked by At

I'm new to Java programming, and I ran into this problem:

I'm creating a program that reads a .csv file, converts its lines into objects and then manipulate these objects. Being more specific, the application reads every line giving it an index and also reads certain values from those lines and stores them in TRIE trees. The application then can read indexes from the values stored in the trees and then retrieve the full information of the corresponding line.

My problem is that, even though I've been researching the last couple of days, I don't know how to write these structures in binary files, nor how to read them. I want to write the lines (with their indexes) in a binary indexed file and read only the exact index that I retrieved from the TRIEs.

For the tree writing, I was looking for something like this (in C)

fwrite(tree, sizeof(struct TrieTree), 1, file)

For the "binary indexed file", I was thinking on writing objects like the TRIEs, and maybe reading each object until I've read enough to reach the corresponding index, but this probably wouldn't be very efficient.

Recapitulating, I need help in writing and reading objects in binary files and solutions on how to create an indexed file.

2

There are 2 answers

1
Marged On BEST ANSWER

I think you are (for starters) best off when trying to do this with serialization.

Here is just one example from stackoverflow: What is object serialization?

(I think copy&paste of the code does not make sense, please follow the link to read)

Admittedly this does not yet solve your index creation problem.

1
John On

Here is an alternative to Java native serialization, Google Protocol Buffers.

I am going to write direct quotes from documentation mostly in this answer, so be sure to follow the link at the end of answer if you are interested into more details.

What is it:

Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.

In other words, you can serialize your structures in Java and deserialize at .net, pyhton etc. This you don't have in java native Serialization.

Performance:

This may vary according to use case but in principle GPB should be faster, as its built with performance and interchangeability in mind. Here is stack overflow link discussing Java native vs GPB:

High performance serialization: Java vs Google Protocol Buffers vs ...?

How does it work:

You specify how you want the information you're serializing to be structured by defining protocol buffer message types in .proto files. Each protocol buffer message is a small logical record of information, containing a series of name-value pairs. Here's a very basic example of a .proto file that defines a message containing information about a person:

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

Once you've defined your messages, you run the protocol buffer compiler for your application's language on your .proto file to generate data access classes. These provide simple accessors for each field (like name() and set_name()) as well as methods to serialize/parse the whole structure to/from raw bytes.

You can then use this class in your application to populate, serialize, and retrieve Person protocol buffer messages. You might then write some code like this:

Person john = Person.newBuilder()
    .setId(1234)
    .setName("John Doe")
    .setEmail("[email protected]")
    .build();
output = new FileOutputStream(args[0]);
john.writeTo(output);

Read all about it here: https://developers.google.com/protocol-buffers/

You could look at GPB as an alternative format to XSD describing XML structures, just more compact and with faster serialization.