How to get the data from the Cassandra every 15 minutes but return me only the information that got changed?

1.2k views Asked by At

I have a Column family in Cassandra in which I am going to store something like this-

BundleName    |     Version
----------------------------
FrameworkBundle    1.0.0
BundleA            1.0.0
BundleB            1.0.0
BundleC            1.0.0
BundleD            1.0.0

I am using Astyanax client to retrieve the data from Cassandra database. I am going to have some method which will retrieve the data from Cassandra-

public Map<String, String> getFromDatabase() {

    // 1) For the first time, return me everything in the map
    // 2) Second time, it should return me only the the change if there is any bundle version change

}

Now this method should return me everything as the Map, something like this-

Key as FrameworkBundle and Value as 1.0.0
Key as BundleA and Value as 1.0.0
Key as BundleB and Value as 1.0.0
....
And for other Bundles like above

Now what I need is-

  1. For the first time when I am running my application, it should return me everything in the map just like above.
  2. And I have a background thread that will check the Cassandra database every 15 minutes to see if there are new versions of the bundles or not. And if there are any new version of any bundle then just return me that Bundle Name and its new version and if there are no changes in any of the version, then don't return me anything second time. And this same process will happen every 15 minutes then.

Meaning only the first time, I want to return everything otherwise, I don't want to return anything unless there is any change in the bundle version.

I am not sure whether Cassandra can provide directly the information on this without writing some sort of logic to get the information I need.

What's the best and efficient way to do this thing out in the Cassandra? I dont want to retrieve all the data from Cassandra database every 15 minutes and then do some sort of logic to find out which bundle version got changed..

1

There are 1 answers

2
John On BEST ANSWER

Well, cassandra is something like a key/value store, so in order to make this happen, you need a sensible row key. You always need the row key when you submit a (column range) query. Neither bundle name nor version are a very good row key since you need to know them in advance. Do you have some kind of application categorization or other feature that you could use for partitioning?

For instance, if you had application type id (commercial, open source, private...) as another field, you could easily create a table where your clustering/column key is a timestamp. Your row key could be your application type id. Whenever there is a new version, insert the version number to application / timestamp. Then, do a range query using the timestamp.

  CREATE TABLE Bundles (
    bundle varchar,
    type varchar,
    ts timeuuid,
    version varchar,
    PRIMARY KEY (type, ts)
   );

If you run for the first time and want to know all new releases, you run:

cqlsh:test> SELECT * FROM Bundles WHERE 
    ...        type = 'OSS' and
    ...        ts < maxTimeuuid('2013-08-27 09:00:00');

(empty resultset)

Since there have been no inserts so far.

Then, you (or some other process) inserts a new release. Assume you have a couple of software categories, named "type" and type is "Frameworks" or "Open Source" or whatever fits your use case, you could insert data like this:

cqlsh:test> INSERT INTO Bundles (bundle, type, ts, version) 
 VALUES ('SomeFramwork', 'OSS', now(), '0.1.0a');

This stores a new column (under the column key value of now()) in the partition 1 (for type, our sharding key).

Fifteen mintues later, if you want to know all new releases over the last 15 minutes, you run:

    cqlsh:test> SELECT type, dateOf(ts), bundle, version FROM Bundles WHERE
     type = 'OSS' and
     ts > minTimeuuid('2013-08-27 09:00:00')
     and ts < maxTimeuuid('2013-08-27 09:15:00');

     type | dateOf(ts)               | bundle       | version
    ------+--------------------------+--------------+---------
      OSS | 2013-08-27 09:14:27+0200 | SomeFramwork |  0.1.0a

You would need a query for each type. The TimeUUD type would guarantee that inserts remain collision free.

If you are worried about rows getting too long (>2 billion), you could use buckets to limit row length.

To insert in Astyanax using cql3 queries, you can use

    keyspace.prepareQuery(CF_BUNDLES).withCql(cql).execute();

where cql is your cql query and CF_BUNDLES is an instance of ColumnFamily.

To fetch data using the cql query defined above in Astyanax you can use

    CqlResult<String, String> result = keyspace
    .prepareQuery(CF_BUNDLES).withCql(cql).execute()
    .getResult();

which enables you to iterate over the results.