I have to store a lot of data on dedicated storage servers in HDFS. This is some kind of archive for historic data. The data being store is row oriented and have tens of different kind of fields. Some of them are Strings, some are Integers, there are also few Floats, Shorts, ArrayLists and a Map.
The idea is that the data will be scanned from time to time using MapReduce or Spark job.
Currently I am storing them as SequenceFiles with NullWritable as keys and custom WritableComparable class as values. This custom class has all of these fields defined.
I would like to achieve two goals - one is to optimize a size of data, as it is getting really big and I have to add new servers every few weeks and the costs are constantly growing. The other thing is to make it easier to add new fields - in current state if I would like to add some new field I would have to rewrite all of the old data.
I tried to achieve this by using EnumMap inside this class. It gave quite good results, as it allows adding new fields easily and also the size of data have been reduced by 20% (the reason is a lot of fields in a record are often empty). But the code I wrote looks awful and it gets even uglier when I try to add to this EnumMap also Lists and Maps. It's ok for a data of the same type, but trying to combine all of the fields is a nightmare.
So I thought of some other popular formats. I have tried Avro and Parquet, but size of the data is almost exactly the same as SequenceFiles with custom class before trying with Enums. So it resolves problems of adding new fields without a need of rewriting old data, but I feel like there is more potential to optimize the size of the data.
The one more thing I am going to check yet is of course the time it takes to load the data (this will also tell me if it's ok to use bzip2 compression or I have to go back to gzip because of performance), but before I proceed with this I was wondering if maybe someone will suggest some other solution or a hint.
Thanks in advance for all comments.
Most of your approach seems good. I just decided to add some of my thoughts in this answer.
None of the types you have mentioned here are any more complex than the datatypes supported by spark. So I wouldn't bother changing the data types in any way.
By adding servers, are you also adding compute? Storage should be relatively cheap, and I'm wondering if you are adding compute with your servers, which you don't really need. You should only be paying to store and retrieve data. Consider a simple object store like S3 that only charges you for storage space and gives a free quota of access requests (GET/PUT/POST) - I believe about 1000 requests are free and it costs only ~$10 for a terabyte of storage per month.
If you have a use case where you will be writing to the files more often than reading, I'd recommend not storing the file on HDFS. It is more suited for write once, read many type applications. That said, i'd recommend using parquet to start since i think you will need a file format that allows slicing and dicing the data. Avro is also a good choice as it also supports schema evolution. But its better to use this if you have a complex structures where you need to specify the schema and make it easier to serialize/deserialize with java objects.
Bzip2 has the highest compression, but is also the slowest. So i'd recommend it if the data isn't really used/queried frequently. Gzip has comparable compression with Bzip2, but is slightly faster. Also consider snappy compression as that has a balance of performance and storage and can support splittable files for certain file types (parquet or avro) which is useful for map-reduce jobs.