Cassandra column family for multiple usecases good/bad

395 views Asked by At

I have a data structure to store different areas/locations, area is part of a city, city is part of a state, state is part of country.

couple of usecases to query are

U1 : Given a location get city, state, country. U2 : Given a Country, State get list of cities, areas

I understand the data to be stored in cassandra should be optimized for the read path.

Key for U1 to be {location, city, state, country}

Key for U2 to be {country, state, city, location}

Would like to know if there is exist a better way other than storing values in two column families

2

There are 2 answers

0
Nageswara Rao On BEST ANSWER

That's the correct way to model, and you can update both tables at once using a logged batch. This is pretty standard modeling. --Ryan Svihla - Solution Architect - datastax

0
Evan Volgas On

Okay so the first thing you'll want to do is start with the questions you want to answer. You did that, and I think you're basically on the right track... but you might need to tweak a few things.

To get started, definitely take a look at this article http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/ Cassandra Cli is being depreciated, but it's still useful to see how data are actually stored in the database.

In particular, you'll want to note that there are potentially two parts to every primary key: the partitioning key and the clustering columns (eg http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/create_table_r.html?scroll=reference_ds_v3f_vfk_xj__using-a-composite-partition-key).

The partitioning key determines which node in a Cassandra cluster will actually have the data. The clustering columns, if any are defined, essentially determine the sort order. They also affect which sort of range queries you can use.

All of that said, you have essentially two questions for your table: " Given a Country...." and "Given a location..."

So, under these circumstances, you might want to consider primary keys that look like

(location, country, state, city) -- location is the partitioning key, (country, state, city) are the clustering columns (you could run SELECT * FROM table WHERE location = X and country = Y but you could not run SELECT * FROM table WHERE location = X and city = Y using an index like this... take a look at http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/select_r.html?scroll=reference_ds_d35_v2q_xj__filtering-data-using-where if this doesn't make sense to you)

And (country, state, city, location) -- here country is the partition key and (state ,city, location) are the clustering columns

Now, all of that said, you might be able to get away with using only the first table and create a secondary index on country. It certainly seems possible, even though it wouldn't be as fast as using two tables. And there are definitely some additionl concerns you'll want to consider about secondary indices (eg http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html) but I doubt if the table you are describing is frequently updated or that there will be extremely high cardinality on country.

So... all that said, consider using two tables with location as your partitioning key in the first case and country as your partitioning key in the second. Also consider only using the first and creating a secondary index on country. The later option might be easier for you to maintain, but it's not gonna be as fast. In does seem like the sort of problem in which a secondary index would be reasonable though. I think it would ultimately depend on your latency requirements.