Disclaimer: I'm not a real-time architectures expert, I'd like only to throw a couple of personal considerations and evaluate what others would suggest or point out.
Let's imagine we'd like to design a real-time analytics system. Following, Lambda architecture Nathan Marz's definition, in order to serve the data we would need a batch processing layer (i.e. Hadoop), continuously recomputing views from a dataset of all the data, and a so-called speed layer (i.e. Storm) that constantly processes a subset of the views (made by the events coming in after the last full recomputation of the batch layer). You query your system by merging the results of the two together.
The rationale behind this choice makes perfect sense to me, and its a combination of software engineering and systems engineering observations. Having a ever-growing master dataset of immutable timestamped facts makes the system resilient to human errors in computing the views (if you do an error, you just fix it and recompute them in the batch layer) and enables the system to answer to virtually any query that would come up in the future. Also, such datastore would require to support only random reads and batch inserts, whereas the datastore for the speed/real-time part would require to support efficiently random reads and random writes, increasing its complexity.
My objection/trigger for a discussion about this is that, in certain scenarios, this approach might be an overkill. For the sake of discussion, assume we do a couple of simplifications:
- Let's assume that in our analytics system we can define beforehand an immutable set of use-cases\queries that hour system needs to be able to provide, and that they won't change in the future.
- Let's assume that we have a limited amount of resources (engineering power, infrastructure, etc) to implement it. Storing the whole set of elementary events coming to our system, instead of already precomputing views\aggregates, may just be too expensive.
- Let's assume that the we succesfully minimize the impact of human mistakes (...).
The system would still need to be scalable and handle ever-increasing traffic and data. Given these observations, I'd like to know what would stop us from designing a fully stream-oriented architecture. What I imagine is an architecture where the events (i.e. page views) are pushed inside a stream, that could be RabbitMQ + Storm or Amazon Kinesis, and where the consumers of such streams would directly update the needed views through random writes/updates to a NoSQL database (i.e. MongoDB).
At a first approximation, it looks to me that such architecture could scale horizontally. Storm can be clusterized, and Kinesis expected QoS could also be reserved upfront. More incoming events would mean more stream consumers, and as they are totally independent nothing stops us from adding more. Regarding the database, sharding it with a proper policy would let us distribute the increasing number of writes to an increasing number of shards. In order to avoid reads to be affected, each shard could have one or more read-replicas. In terms of reliability, Kinesis promises to reliabily store your messages for up to 24 hours, and a distributed RabbitMQ (or whatever queue system of your choice) with proper usage of acknowledges' mechanisms could probably satisfy the same requirement.
Amazon's documentation on Kinesis deliberately (I believe) avoids to lock you into a specific architectural solution, but my overall impression is that they would like to push developers to simplify the Lambda architecture and arrive to a fully stream based solution similar to the one I've exposed. To be slighly more compliant to the Lambda architecture requirements, nothing stops us from having, in parallel with the consumers constantly updating our views, a set of consumers that process the incoming events and store them as atomic immutable units in a different datastore that could be used in the future to produce new views (via Hadoop for instance) or recompute faulty data.
What's your opinion on this reasoning? I'd like to know in which scenarios a purely stream-based architecture would fail to scale up, and if you have any other observations, pros\cons of a Lambda architecture vs a stream-based architecture.