I'm looking into how BigTable compresses my data.
I've loaded 1,5GB into 1 table; about 500k rows containing 1 column, on average each cell holds about 3kb. In further tests more columns will be added to these rows containing similar data with similar size.
The data in each cell is currently a JSON serialized array of dictionaries [10 elems on avg], like:
[{
"field1": "100.10",
"field2": "EUR",
"field3": "10000",
"field4": "0",
"field5": "1",
"field6": "1",
"field7": "0",
"field8": "100",
"field9": "110.20",
"field10": "100-char field",
"dateField1": "1970-01-01",
"dateField2": "1970-01-01",
"dateTimeField": "1970-01-01T10:10:10Z"
},{
"field1": "200.20",
"field2": "EUR",
"field3": "10001",
"field4": "0",
"field5": "1",
"field6": "0",
"field7": "0",
"field8": "100",
"field9": "220.30",
"field10": "100-char field",
"dateField1": "1970-01-01",
"dateField2": "1970-01-01",
"dateTimeField": "1970-01-01T20:20:20Z"
}, ...]
The BigTable console shows me that the cluster holds 1,2GB. It thus compressed the 1,5GB I inserted to roughly 80% of the original size. Gzipping a typical string as they are stored in the cells however gives me a compression ratio of about 20%.
This compression performance of BigTable thus seems low to me, given that the data I'm inserting holds a lot of repetitive values (e.g. the dictionary keys). I understand that BigTable trades of compression for speed, but I'd hoped it to perform better on my data.
Is a compression ratio of 80% ok for data like above, or are lower values to be expected? Are there any techniques to improve the compression, apart from remodeling the data I'm uploading?
Thanks!
Lower values are definitely expected. We've found and fixed a bug related to use of compression in Cloud Bigtable, which is now live in production.
For data such as the example you posted, you should now be seeing a much higher compression ratio and lower disk usage!