I have a Column family in Cassandra in which I am going to store something like this-
BundleName | Version
----------------------------
FrameworkBundle 1.0.0
BundleA 1.0.0
BundleB 1.0.0
BundleC 1.0.0
BundleD 1.0.0
I am using Astyanax client to retrieve the data from Cassandra database. I am going to have some method which will retrieve the data from Cassandra-
public Map<String, String> getFromDatabase() {
// 1) For the first time, return me everything in the map
// 2) Second time, it should return me only the the change if there is any bundle version change
}
Now this method should return me everything as the Map, something like this-
Key as FrameworkBundle and Value as 1.0.0
Key as BundleA and Value as 1.0.0
Key as BundleB and Value as 1.0.0
....
And for other Bundles like above
Now what I need is-
- For the first time when I am running my application, it should return me everything in the map just like above.
- And I have a background thread that will check the Cassandra database every 15 minutes to see if there are new versions of the bundles or not. And if there are any new version of any bundle then just return me that Bundle Name and its new version and if there are no changes in any of the version, then don't return me anything second time. And this same process will happen every 15 minutes then.
Meaning only the first time, I want to return everything otherwise, I don't want to return anything unless there is any change in the bundle version.
I am not sure whether Cassandra can provide directly the information on this without writing some sort of logic to get the information I need.
What's the best and efficient way to do this thing out in the Cassandra? I dont want to retrieve all the data from Cassandra database every 15 minutes and then do some sort of logic to find out which bundle version got changed..
Well, cassandra is something like a key/value store, so in order to make this happen, you need a sensible row key. You always need the row key when you submit a (column range) query. Neither bundle name nor version are a very good row key since you need to know them in advance. Do you have some kind of application categorization or other feature that you could use for partitioning?
For instance, if you had application type id (commercial, open source, private...) as another field, you could easily create a table where your clustering/column key is a timestamp. Your row key could be your application type id. Whenever there is a new version, insert the version number to application / timestamp. Then, do a range query using the timestamp.
If you run for the first time and want to know all new releases, you run:
Since there have been no inserts so far.
Then, you (or some other process) inserts a new release. Assume you have a couple of software categories, named "type" and type is "Frameworks" or "Open Source" or whatever fits your use case, you could insert data like this:
This stores a new column (under the column key value of now()) in the partition 1 (for type, our sharding key).
Fifteen mintues later, if you want to know all new releases over the last 15 minutes, you run:
You would need a query for each type. The TimeUUD type would guarantee that inserts remain collision free.
If you are worried about rows getting too long (>2 billion), you could use buckets to limit row length.
To insert in Astyanax using cql3 queries, you can use
where cql is your cql query and CF_BUNDLES is an instance of ColumnFamily.
To fetch data using the cql query defined above in Astyanax you can use
which enables you to iterate over the results.