Cassandra pattern for modeling columns with mutable keys

Question

Cassandra pattern for modeling columns with mutable keys

691 views Asked by DanielSmedegaardBuus At 06 December 2013 at 10:38

This is my first week with Cassandra :)

I'm doing a gamey sort of application which should be able to service millions of users across the world, and Cassandra seems like a good bet for the data store.

Modeling for fast reads is fun! But I also have some concerns about whether or not I'm going down the right path with regards to some things, and this is one of those things.

Consider this CF:

CREATE TABLE leaderboard (
    user_id timeuuid,
    year int,
    month int,
    week int,
    country varchar,
    name varchar,
    above_threshold boolean,
    predictions_count int,
    correct_predictions_count int,
    accuracy_percentage float,
    PRIMARY KEY ((year, month, week, country), above_threshold,
    accuracy_percentage, correct_predictions_count, user_id)
) WITH CLUSTERING ORDER BY (above_threshold DESC, accuracy_percentage
  DESC, correct_predictions_count DESC);
CREATE INDEX user_id ON leaderboard(user_id);

As its name states, it stores leaderboard data, using a bitmap index pattern to allow fetching leaderboards for any year and/or month and/or week globally or in a specific country, e.g.

SELECT userId, name, country FROM leaderboard WHERE year=2013 AND
month=0 AND week=0 AND country='' LIMIT 10;

This would fetch the top ten global leaderboard. The data stored here is the accuracy of a user's predicted outcomes on a number of events. The "above_threshold" boolean signifies whether the current score qualifies as a rankable score. Put simply, a 100% average based on just one prediction normally won't qualify the user to rank on the first place — he or she would have to have made more predictions to qualify. On the other hand, at the beginning of a new time period, we don't want to have an empty leaderboard until users have made enough predictions to qualify for ranking there, so I've added the "above_threshold" flag and made it the primary decisor in the compound column key so that non-qualifying ranks are allowed in the top of the leaderboard so long as there aren't enough qualifying ranks to fill it.

Anyway! :) Given the quite large truth table for the bitmap index row key, there's gonna be quite a few writes here when data is inserted and changed. I know writes are fast, so I'm not too concerned about that.

My concern has to do with updating existing scores. When an event takes place, my application will distribute scores to all those users that have made a prediction on that event. This means updating the leaderboard entries in a large part of the columns for each of those users. And furthermore, since we're not allowed to change the primary key, I'll have to delete the old scores, then insert the new ones.

This seems a bit ... ugly to me. I can't think of a better way to do it, though. Logically, it'd be nice to have rows as in the above table, and then a column key per 1/10th of a percentage point (i.e. 100.0, 99.9 ... 0.0) with each column containing the number of users currently at that score and then a map of all those users' ids and names. But adding and removing users from that map would be a performance nightmare, not to mention the 64K items limit on Cassandra collections. Querying for a top-something leaderboard would also be ugly as far as I can predict.

So what do you think? Have I overcomplicated things here, am I still to relational in my mindset, or is this acceptable Cassandra-style data modeling for such a usecase?

UPDATE; as requested, here's the current implementation. It's a bit different from my initial stab at it, as I needed something atomic for safely counting, i.e. counters, and mixing counters with non-counters outside the primary key in a CF is (was?) a no-go. It's been a while since I last dealt with this code, so I'm hazy about all the details. Either way, this is the schema definition for my user scoring CF:

CREATE TABLE "scoreCounters" (
    "userId" text,
    "year" int,
    "month" int,
    "week" int,
    "day" int,
    "friendId" text,
    "games" counter,
    "gamesWon" counter,
    "predictions" counter,
    "hits" counter,
    "globalRank" counter,
    "countryRank" counter,
    PRIMARY KEY ("userId", "year", "month", "week", "day", "friendId")
) WITH
CLUSTERING ORDER BY ("year" ASC, "month" ASC, "week" ASC, "day" ASC) AND
COMPACTION = { 'class': 'LeveledCompactionStrategy' } AND
GC_GRACE_SECONDS = 864000 AND
COMMENT = '
= All the counters belonging to a user and all of her friends over the span of
  her history with us, ordered and clustered by periods in time in a matrix
  primary key. When a counter is incremented or decremented, it is done in five
  places, covering all-time, yearly, monthly, weekly, and daily scores. Zeroes
  are used for "wildcard" positions in the matrix.

  This is replicated to the partition of every known friend, making fetching the
  leaderboard of any one user for any particular period in time a single SELECT.

  Note that we do not store names here, as they would be required to be part of
  the clustering key, effectively making them immutable. 

  Also note that the date parts are UTC dates, so roll over is at midnight UTC.

  IMPORTANT: Week 1 will more often than not appear twice in a year, once in
  January, and once in December. This means that weekly counters on week 1 in
  the month of December may still be incremented until the end of the week in
  the year that follows. Similarly, weekly scores in the first week in January
  may already contain counters collected in the same week in December the
  previous year.

  The same holds for week 52 or 53, when it continues into the following year.

  This way, you never have to do two reads to get the weekly score. But on the
  other hand, you should know that if you add all weekly scores together, you
  will get a higher count than what is accumulated for that year.

  You should also make sure that if you iterate over the counters for each week
  in a year, the first week may not be 1, but 52 or 53 (in January), and the
  last one might not be 52 or 53.

* games, gamesWon, predictions, and hits (# correct predictions) are the
  basis of all ranking and scoring statistics.

* globalRank and countryRank are the most recently calculated ranks for the
  user on the global and regional leaderboards.
';

Actually, I have two CFs, and I don't recall if I use both or them or just the following... Sorry for the half-assed addendum here :/

CREATE TABLE "userCounters" (
    "userId" text,
    "year" int,
    "month" int,
    "week" int,
    "day" int,
    "games" counter,
    "gamesWon" counter,
    "predictions" counter,
    "hits" counter,
    "globalRank" counter,
    "countryRank" counter,
    PRIMARY KEY ("userId", "year", "month", "week", "day")
) WITH
CLUSTERING ORDER BY ("year" DESC, "month" DESC, "week" DESC, "day" DESC) AND
COMPACTION = { 'class': 'LeveledCompactionStrategy' } AND
GC_GRACE_SECONDS = 864000 AND
COMMENT = '
= All the counters belonging to a user, ordered and clustered by periods in time
  in a matrix primary key. When a conuter is incremented or decremented, it is
  done in five places, covering all-time, yearly, monthly, weekly, and daily
  scores. Zeroes are used for "wildcard" positions in the matrix.

  Note that the date parts are UTC dates, so roll over is at midnight UTC.

  IMPORTANT: Week 1 will more often than not appear twice in a year, once in
  January, and once in December. This means that weekly counters on week 1 in
  the month of December may still be incremented until the end of the week in
  the year that follows. Similarly, weekly scores in the first week in January
  may already contain counters collected in the same week in December the
  previous year.

  The same holds for week 52 or 53, when it continues into the following year.

  This way, you never have to do two reads to get the weekly score. But on the
  other hand, you should know that if you add all weekly scores together, you
  will get a higher count than what is accumulated for that year.

  You should also make sure that if you iterate over the counters for each week
  in a year, the first week may not be 1, but 52 or 53 (in January), and the
  last one might not be 52 or 53.

* games, gamesWon, predictions, and hits (# correct predictions) are the
  basis of all ranking and scoring statistics.

* globalRank and countryRank are the most recently calculated ranks for the
  user on the global and regional leaderboards.
';

Here's the node.js code used to increment the counters (prepares queries for async and node-cassandra-cql:

/**
 * Generates queries for incrementing (or decrementing, when given negative values) user counters.
 * 
 * As noted in the schema, certain weeks (1, 52, and 53) will sometimes span two years, in December and January.
 * 
 * To ensure that we can always read once to get the counters we want in the time frame that we want, we update
 * weekly counters on such weeks in both years. Or, to put it very precisely: If we're in December, week 1, or
 * January, week 52 or 53, we update the counters on that week on both sides of New Year — just the weekly
 * counters in the other year, mind you, not the monthly nor yearly ones.
 * 
 * @param   {string}    userId      User's id
 * @param   {object}    counters    Hash of counter keys to increment and the integer amounts by which to do so.
 * @param   {Date}      date        Date instance representing the time for when to increment the counters.
 * @param   {function}  queries     (optional) array to add queries to if you prefer that to handling the return value.
 * @return  {Array}                 Queries array (same as the queries argument if given).
 */
module.exports.incrementCounters = function (userId, date, counters, queries) {

    var query  = 'UPDATE "userCounters" SET ',
        userId = { hint: 'text', value: userId },
        zero   = { hint: 'int', value: 0 },
        year   = { hint: 'int', value: date.getUTCFullYear() },
        month  = { hint: 'int', value: date.getUTCISOMonth() },
        week   = { hint: 'int', value: date.getUTCWeek() },
        day    = { hint: 'int', value: date.getUTCISODay() };

    if (!(queries instanceof Array))
        queries = [];

    for (var counter in counters)
        query += '"'+counter+'" = "'+counter+'" + '+counters[counter]+',';

    query = query.substr(0, query.length-1)+' WHERE "userId" = ? AND year = ? AND month = ? AND week = ? AND day = ?';

    queries.push({ query: query, params: [userId, zero, zero, zero, zero] });
    queries.push({ query: query, params: [userId, year, zero, zero, zero] });
    queries.push({ query: query, params: [userId, year, month, zero, zero] });
    queries.push({ query: query, params: [userId, year, month, week, zero] });
    queries.push({ query: query, params: [userId, year, month, week, day] });

    /* Check if we should also add to this week for last year. */
    if (month.value === 1 && week.value > 51) {
        log.debug && log.debug('Incrementing weekly user score counters for this week next year, as the current week spans New Year');

        queries.push({
            query  : query,
            params : [userId, { hint: 'int', value: year.value-1 }, { hint: 'int', value: 12 }, week, zero]
        });
    }
    /* Or for the coming year. */
    if (month.value === 12 && week.value === 1) {
        log.debug && log.debug('Incrementing weekly user score counters for this week last year, as the current week spans New Year');

        queries.push({
            query  : query,
            params : [userId, { hint: 'int', value: year.value+1 }, { hint: 'int', value: 1 }, week, zero]
        });
    }

    return queries;
};

Original Q&A

TechQA.

Cassandra pattern for modeling columns with mutable keys

There are 0 answers

Related Questions in CASSANDRA

Related Questions in DATA-MODELING

Related Questions in CLUSTERED-INDEX

Popular Questions

Popular Tags

Trending Questions