Efficient distributed counting

3.7k views Asked by At

I have a series of events flowing through a system (e.g a pizza ordering system) and I want to count certain properties of each event through time. For example, I might want to see how many unique people ordered pepperoni pizza in the last 5 minutes, or how many pizzas John Doe ordered in the past week.

It is a LOT of events, so we're using something like Cassandra or HBase because even the counts can't be stored in memory. Also, since we need to keep track of set membership (in order to count unique people ordering a particular kind of pizza, for example), it gets bigger.

We could store a list of orders and then query to count, but this is slow. And we mostly don't care who ordered pepperoni pizza, just how many unique orders were made, and in a given time window.

What's the best way to store this information, for example in Cassandra, such that the information can be retrieved in some time intervals?

I tried at first to use Redis + bloom filters, but storing a bloom filter bit vector would require transactions to avoid race conditions, so then I used redis sets.

Then I realized the whole thing was too big to just be in memory, so I decided to switch to a disk-backed store. However, there are no native sets like in redis.

I looked at sketches / streaming algos like HyperLogLog but the conclusion was that to save the hyperloglog object, I need to store the bit array (or pickle the object or whatever)...is that kosher, and what are the best practices for this, if this is indeed the solution?

I was tempted to save each event individually with a timestamp, then query and count on demand, but this is slow. I'm looking for something better, if it exists.

Example Requests:

  • How many unique people had a pepperoni pizza order in the past 10 minutes
  • How many unique pepperoni pizzas were ordered by some person John Doe in the past 30 minutes
2

There are 2 answers

0
Sam On BEST ANSWER

There are a few ways to approach this problem from what I have learned.

  1. Use locking + set membership / counting data structure e.g hyperloglog or bloom filter. As long as there's not that much fighting over a particular lock, things should be okay.
  2. Use a database that has built-in sets/collections support. They pretty much implement #1 internally.
2
Zoltán Haindrich On

my guesses:

  • cassandra supports counters - i think i saw some incr operation which should work concurrently - by using free running counter on your event, you just need to setup something which samples all counters at specified intervals (5 min?) then you can give estimations between two samples (http://wiki.apache.org/cassandra/Counters)
  • cassandra can timeout a column..i never really used it, but it might worth a try