dump a json file in to form of binary file

101 views Asked by At

I have a very big json file which is probably closely 1.5GB which is kinda of a format of

[
  {
    "a": 123,
    "b": 143,
    "c": "111",
    "d": 0,
    "e": 0
  },
  {
    "a": 1455,
    "b": 1899,
    "c": "1334",
    "d": 0,
    "e": 0
  }
]

So i wanted to find a way of convert this to a form of binary and read the data inside of this json file efficiently. Can I know any way to achieve this using Scala ?

1

There are 1 answers

0
stefanobaghino On

I'll give you one possible answer and then go into the details of choosing a serialization format, which hopefully can serve as a guide considering the bigger picture that you may be aware of.

There are multiple binary serialization formats in existence, each with its own strengths and weaknesses to consider, which is why giving you "one right answer" is not even worth attempting.

One suggestion I can give you to get you started, is to evaluate Apache Avro, a widespread binary format which is often associated with Apache Kafka and in general large-scale data processing systems.

Here is a short example of how you could go about parsing the original JSON file with Circe and then using avro4s to serialize to Avro:

import io.circe.generic.auto._, io.circe.syntax._
import io.circe._, io.circe.parser._

final case class Item(a: Int, b: Int, c: String, d: Int, e: Int)

val items = parse(json).flatMap(_.as[Vector[Item]]).toTry.get

import com.sksamuel.avro4s._

val schema = AvroSchema[Item]
val baos = new java.io.ByteArrayOutputStream()
val os = AvroOutputStream.data[Item].to(baos).build()
os.write(items)
os.flush()
os.close()
baos.close()

You can play around with this code here on Scastie. Note that:

  1. I kept json as a lazy val at the end of the playground, since Avro needs to store the data schema alongside the data itself and this means that for a couple of objects the self-describing JSON format is actually more compact, so I needed to include more to make sure I wasn't misrepresenting the advantages and disadvantages of the format
  2. Here I'm serializing to a byte array so that the online playground I linked to give you a sense of the savings you can achieve using a binary format, in reality you would most likely write to some sort of persistent storage

As mentioned, the one above is one possible answer. In reality choosing a data serialization format is not trivial and requires a good understanding of the trade-offs involved, considering that once one is chosen, depending on the systems involved, migrating away from it could be a long, difficult, and risky operation.

If what you are interested in is just saving space, you might want to see if simply compressing the JSON does the work for you, knowing that this will translate space savings into computing time that might otherwise be spent in other ways. If the structure of the file is very predictable, a compressed JSON file might be able to compete with a compressed binary file. JSON has a number of advantages with regards to availability of tools and the ability of inspecting the files themselves without any specialized tool apart from a text editor.

If you are still convinced that you want to move to a binary serialization format there are many in existence that were born out of specific necessities, and you might want to look into them one by one to assess whether they fit your project. I'll name a few of the most widely used ones:

  • Avro
  • Protocol Buffers
  • Thrift
  • Flatbuffers
  • Cap'n'proto
  • BSON
  • Parquet
  • CBOR
  • MessagePack

This page on Wikipedia might get you started by giving you a general idea of a few characteristics of each. A few questions you might want to ask yourself are:

  • what kind of support is there for the language(s) that my team work with?
  • how is the format generally used and why?
  • how does the format allow me to think about schema evolution and backward/forward compatibility?
  • is the format self-describing or does it need to refer to an externally stored schema?