Getting up to date answers about recent events and asking the same question about older data is really hard to do using open source. The Lambda Architecture is a somewhat complicated way of doing this kind of mass scale, relatively fast, open source data processing.
The key point is that you have two layers: a speed layer that can answer questions ‘quickly’ (seconds behind reality), and a batch processing layer that works in the half-hour to 2 hour timescale. Both are doing the same work and are expected to produce the same answers for the same questions. The idea is that after an hour or so, the information in the speed layer is discarded, and any future requests are sent to the batch layer, where the same question can now be asked and will get the same answer.
Cool eh? Well, actually, “No”. What, specifically, is wrong with it?
Firstly, you have to write everything twice. In the official “Lambda Architecture,” data is sent to both the speed layer and the batch layer as it is created. Any logic is duplicated and implemented twice, using different tools. The batch layer – by definition – takes a while to produce results so the speed layer does the same work so it can answer questions about in flight events and recent activities.
Doing everything twice with different toolkits inevitably means numbers won’t reliably tie. When you’re dealing with billions of incoming real world events there will be oddities and edge cases that aren’t going to covered in the specifications and will be left to developer initiative, the default behavior of the toolkit or possibly just blind luck to handle. When you only do this once people only see ‘one view of truth’ and thus frequently neither know nor care about esoteric scenarios, but when the numbers don’t add up because you’re doing it twice people in corner offices start asking questions.
Change management is also complicated, as you have to carefully co-ordinate a rolling change as the data moves downstream.
Then you have to physically run everything twice, which means significantly increased cloud hosting costs.
How did we get here?
I would argue that the “Lambda Architecture” is a consequence of people building data lakes and then wanting to use the data to ‘influence the moment’ – change or react to events that are happening right now. If you’ve already built a giant HDFS data lake then the idea of writing a nice, small, layer to keep an eye on ‘now’ and then discard the results makes sense. But when you look at it from an ‘empty whiteboard’ perspective it seems more than a little odd, doesn’t it? Why do everything twice if you don’t have to?
The crucial factor could be that at the time the Lambda Architecture was envisaged the concept of Translytics/HTAP hadn’t taken off. Translytics/HTAP refers to an emerging class of database products that can do transactions and count at the same time, or to be precise not only keep track of the kind of real time, rolling aggregates and totals that the Lambda speed layer covers but also make them visible to users on demand. Bear in mind that a legacy SQL RDBMS will be awful at doing transactions and tracking aggregate values at the same time, and open source ‘plumbing’ products frequently lack both query capability and the ability to do complex logic.
The good news is that VoltDB is an ideal ‘fast layer’ for a kind of modified Lambda architecture. Being a full-featured DB, it’s great for ‘cooking’ the data and answering questions about it. But it also has the capability to send that data to a downstream system using export tables and then purge it after a delay long enough to ensure that it’s visible in both places. And all the fancy logic will be in one place. Try VoltDB now, free for 30-days.