Is storing 86k super columns (with 1-10 small columns each) per row a good idea in Cassandra?

270 views Asked by At

tldr: Is ~90,000 super columns with 1 to 10 columns each too many in one row? How about ~1500? Column values are about 6 bytes each.

full question:

I am researching various data stores for time series data. Column oriented databases such as Cassandra and HBase look to be a very good fit.

The requirements are to store millions of series of data coming it at (minimum) a 1 minute interval. Ideally we would be able to support a 1 second interval if the business needs demand it (they might probably will).

The advice offered in this blog post as well as used by OpenTSDB make a ton of sense.

Essentially keys are the series id concatenated to the first time stamp of the day, columns are created for each measurement in the day. That is about 86400 columns per row.

However immutability/versioning of the data is quite important. Business needs dictate the ability to update series values while retaining full history of the data.

Exploring Cassandra's super columns to provide another dimension in order to version the values results in 86400 super columns. Each super column would then contain one column when the value is first created (possibly a TimeUUID), then have one more column added on each "update". Updates will occur to a regularly to limited subsets of series and values. Under ideal conditions there will be no updates. Ideally this means each super column does not have a huge amount of data to load, and most access will be only to the most recent value.

So to come back to the question:

Is there a performance hit or issue I am over looking for using that many (86k) super columns per row?

1

There are 1 answers

1
jbellis On BEST ANSWER

Conservatively taking 100K supercolumns and 1K per supercolumn that comes out to 100MB per row, which is well within what Cassandra can handle.

Another factor you should consider is, how many rows you have. "One big row" is a bad data model since the row is the unit of partitioning. As long as you have many more rows than you have nodes, then you should be fine.