Appropriate way to denormalize Cassandra tables to support similar queries with additional parameters

63 views Asked by At

My data model is really simple. It models visits to web pages.

This is what my Visit model looks like (the syntax is express-cassandra schema syntax):

fields: {
    id: {
        type: 'uuid',
        rule: {
            required: true,
            message: 'id is required'
        }
    },
    userId: {
        type: 'int',
        rule: {
            required: true,
            message: 'userId is required'
        }
    },
    dateOfVisit: {
        type: 'timestamp',
        rule: {
            required: true,
            message: 'dateOfVisit is required'
        }
    },
    urlPort: 'int',
    urlHost: {
        type: 'text',
        rule: {
            required: true,
            message: 'urlHost is required'
        }
    },
    urlPath: 'text',
    urlQuery: 'text',
    urlProtocol: {
        type: 'text',
        rule: {
            required: true,
            message: 'urlProtocol is required'
        }
    },
    urlHash: 'text',
    pageTitle: 'text'
},
key: [['id'], 'dateOfVisit'],
clustering_order: {'dateOfVisit': 'desc'}

I have a few questions about this model:

Question #1:

I'm pretty sure I want to store the parts of the URL instead of storing the whole URL as a single string because it allows me to more easily run queries for visits to a specific domain, visits to a specific path within a domain, visits to secure pages vs. visits to insecure pages, links from secure pages to insecure pages (or the inverse), etc.

But, would it be better to store the parts of the URL as A) individual columns or B) as a single Map column.

Also, will I have to create additional tables with different primary keys just to support all the various queries from querying on different parts of the url?

Question #2

I'm going to have a number of different ways that I'm going to want to query the data.

  • Get all visits across all users
  • Get all visits for a single user
  • Get all visits on a given day or bucketed by hour within a given day
  • Get all visits to a given domain
  • Count all visits to a given domain grouped by path

So, given the various different types of queries, how should I store this model?

I currently have essentially the exact same fields stored in multiple tables just with different primary keys (one table with just (id) to support "get all visits across all users", another table with (id, userId) to support "get all visits for a specific user", etc.

This just feels like it creates multiple copies of the data just to support essentially the same query but with one additional condition to the where clause.

Is there a better way to model this?

1

There are 1 answers

0
Nadav Har'El On

On question 1: since the components of the URL always have the same keys (host, port, path, etc.) it is more efficient to have them as individual columns, and not a map. Especially in Cassandra 3 (or the upcoming Scylla 3.0) where the new, more efficient, file format doesn't require repeating the column names for every row - but such repeats will be necessary for a map (which could, in theory, have different keys in every instance).

On question 2: one thing you could do instead of maintaining several tables yourself (and always worry if the content of these different tables is consistent), you could use the Materialized Views feature (again, added in Cassandra 3 and in Scylla 3) which maintains all these different tables for you. This will still require the additional storage space on disk for all these tables, but will simplify your application. Another thing you could do is to use secondary indexes - which does not duplicate all the data, but rather create additional index tables which allow to find the original data in the table. E.g., such an auxiliary table would be used to get, given a URL path, the list of visits (keys to your original table) which have this path. But you don't need to maintain this table yourself - all you need to do is the ask to index a particular column, and Cassandra will maintain this table for you automatically and use it in queries which search for a certain value of this column.