First off: I’m not here to bash NoSQL. NoSQL came into existence for a good reason — your generic legacy SQL RDBMS is often slow and expensive when compared to NoSQL alternatives. But much of the discussion I’ve seen about NoSQL, NewSQL, Legacy RDBMS and other technologies gets hung up on such issues as the use of SQL instead of focusing on what I would argue is the single most important issue we face —the fundamental barrier to scalability—is contention for shared data. Once you accept this your architectural choices become much clearer.
What does ‘contention for shared data’ mean?
With the emergence of the IoT, we have reached the point where the majority of database interactions are now with other computer programs, not humans. This has significant implications:
- Latency expectations will become much more demanding, as not only does software hate waiting more than a couple of milliseconds it becomes more complex when it has to.
- Volumes will increase by 10X – 100X, as millions of tiny devices communicate with each other in real time and every other aspect of our lives is managed and optimized on a millisecond by millisecond basis.
- Accuracy will become ever more important because we are now making decisions that impact real world physical resources.
Making instant, accurate decisions on an individual basis at mass scale is technically demanding. Access to the data associated with an individual or their IoT device faces the same challenges as a traditional OLTP system, but with roughly 100X the volume. When you consider that some of this data will be subject to very frequent changes (such as using credit) it becomes apparent that you need an appropriate architecture. There are a limited number of ways you solve this problem…
- Use a session to make a series of discrete changes to the data and then either commit or rollback.
- Move the data to the client, which changes it and sends it back.
- Move the problem to the data, do what needs to be done, send an answer back.
For trivial examples they all work about equally well. But as the application complexity increases problems appear.
Method #1: Using a session to do a series of reads and changes which are then committed or rolled back
This is the basis of traditional JDBC style interaction. A key premise of this approach is that until you commit or rollback no other session can see what you have done. Another aspect is that each SQL request generally requires its own network round trip. So from a client perspective, performance degrades as complexity increases, even if each of the SQL statements is trivial.
From the server side things get unpleasant, fast. For early single CPU servers, keeping track of what each of the fifty or so sessions could see and do was relatively easy, as in practice only one session could be on the CPU at once. Multi core CPUs are a double edged sword for a legacy RDBMS. They may provide extra CPU power, but when handling a session a given CPU core now has to anticipate that one of the 500 or so other sessions might be running on another core at the same instant and is about to do or see something that it shouldn’t. As a result, locking and latching now take up roughly 30% of all the available CPU time. When you start trying to do this with a cluster of several machines, the overheads involved skyrocket, as instead of inter-process communication and shared latches, we’re now dealing with network trips to another cluster member to request the use of a resource.
This is the reason why horizontal scalability often doesn’t work for a legacy RDBMS if you have a write-centric workload. Note that the issue is not the use of SQL, but that at any moment in time, each session has its own collection of uncommitted changes that collectively create a massive overhead.
Method #2: Use a NoSQL approach where you move the data to the client, which changes it and sends it back.
From the very start, KV stores were written with clusters in mind, and as a consequence, they nearly always use some form of partitioning algorithm to map the key to one or more physical servers. This bypasses the clustering overheads seen in a legacy RDBMS. Neither do KV stores generally provide read consistency or locking, which means that another large chunk of CPU time previously devoted to latching and locking goes away. Instead, you provide a key and retrieve a value, which could be anything at all. Developers love KV stores because they get to decide what the value is without having to define a schema. But while KV stores are very efficient for simple transactions from a raw CPU usage perspective, they struggle as the application complexity and speed increases.
The first problem is that there is no mechanism to control access to data.
At any moment in time, there could be a number of updated values for a given key on their way back to finish an update. For many applications this has a nugatory impact, but if something of value is being used up, spent, or allocated, we run the risk of either having an update be instantly overwritten, or in the “best case scenario,” be rejected. In real life we will usually see a small number of records that everyone wants to update, and for a KV store this can present a real challenge, as it may take an arbitrary number of attempts to make an update work if a large number of clients are trying to access it at once. The bottom line is that this approach has the same scalability issues as session-based interaction, but they manifest themselves as aborted transactions, network bandwidth consumption, and erratic latency instead of server side locking errors and CPU usage.
The second problem is one of ‘emergent complexity’
What happens when you need more than 1 key value pair to implement your use case? Suddenly life has become very hard for the client, as it not only has to organize multiple atomic updates of KV pairs but it also has to have contingency plans for if some – but not all – fail for some reason. Add in the ‘secret sauce’ of eventual consistency and life becomes really hard.
The bottom line: KV stores will work well for small applications, but as complexity increases they end up as bad as the traditional legacy RDBMS approach, but with the pain manifesting itself as complexity and inaccuracy, instead of poor CPU performance and high licence fees.
The third method
Stay tuned for the third method, why it’s better, and what the implications are for modern database and application architecture coming in a post later this week. Let us know what you think of part 1 in the comments below.