Trying to choose a database to solve a problem (or a whole set of them)? Here’s a quick rundown of the advantages – and disadvantages – of NoSQL versus NewSQL. Choosing the right tool for the job at hand is 80 percent of getting to a solution; the other 20 percent is really understanding the problem you’re trying to solve.
I’ll start by pointing out that the term NoSQL is about as descriptive as categorizing unicycles and wheelbarrows as “NoTwoWheels”. In truth, NoSQL is a broad category collecting disparate technologies beneath an ambiguous umbrella. The term offers little help to the practitioner trying to find the right tool for the right job.
So let’s break it down with an eye on what we really care about as builders-of-all-things-bits: what problems can I solve with NoSQL? Equally important, where is NoSQL a bad fit? Where do the different technologies show their strengths?
Call me Always, or call me Consistent: availability first
Say you have gigabytes to petabytes of data. New data is added regularly and, once added, is relatively static. A database that archives sensor readings or ad impression displays is a good example. You want to store this in a cloud (maybe a private cloud) and are willing to tolerate the programming challenges of eventual consistency (made easier because most updates are idempotent anyway) for distributed access, multi-data-center replication, and the highest possible availability. Your application to database interactions are simple “CREATE” and “GET” patterns that don’t require traditional transactions. The most important consideration is that the database is always available to accept new content and can always provide content when queried, even if that content is not the most recent version written.
These are all strong indicators to consider eventual-consistency stores like Cassandra or Riak or SaaS offerings like Amazon’s S3.
Call me Always patterns
- Idempotent writes: log and journal oriented updates.
- Availability trumps correctness and ease of programming model.
- Multi-region active/active replication for availability is required.
- Long-term storage of large datasets is necessary.
- Eventual-consistency algorithms allow implementations to deliver the highest availability across multiple data centers.
- Eventual-consistency based systems scale update workloads better than traditional OLAP RDBMs, while also scaling to very large datasets.
- These systems are fundamentally not transactional (ACID). If they advertise otherwise, beware the over-reaching claim.
- OLAP-style queries require a lot of application code. While the write-scaling advantages are appealing vs. OLAP stores (like Vertica or GreenPlum), you sacrifice declarative ad hoc queries – important to historical analytical exploration.
Unstructured Melody: raw storage capacity
You have a growing corpus of static data. However, you don’t need to GET documents regularly. Instead, your data scientists want to run arbitrary analyses against nearly unlimited data. Even when the data is structured (XML, JSON, logs), you don’t want to enforce a relational schema on write. Rather, you’d prefer to specify a schema on read.
This is the classic HDFS + MapReduce use case. Emerging technologies like Impala (based on Google’s Big Query ideas) make querying structured data sets even easier – but the storage remains file-system based and distributed. The queries (or MapReduce jobs) that run are implicitly batch-oriented and executed against historical data.
Unstructured melody patterns
- You don’t want to pick a schema up-front.
- READ work is analyzing large portions of the data set.
- Stored data is never changed-in-place.
- You can store pretty much anything on low cost commodity disk
- You can group, extract, and transform data arbitrarily with custom MapReduce jobs.
- Analysis is not computationally efficient and requires custom code. For cases where data is structured, Impala, Spark and other SQL-on-Hadoop technologies are emerging.
- Stored data is effectively read-only
Magical Document Tour: general-purpose document stores
You have data that is naturally document-oriented – or consists of very wide, very sparse rows. You’re looking for a database that stores options, features, settings, profile attributes and content on a user-wide basis – content with many possible, optional pieces of metadata. Data access is single-document oriented. You have minimal need to analyze or perform transactional decisions and can tolerate a latency hit if a cold record needs to be read from disk.
Many document and column-oriented NoSQL databases address this space. Think MongoDB, HBase and Couchbase as examples. These come in two data models: key/value document models, and column-family models. Both systems offer some amount of access atomicity on a per-record basis.
Magical Document patterns
- Structured data, but sparse schema or inefficient to relationally normalize
- Minimal need for analytics, especially cross-record summaries
- Easy to conceptualize – atomic documents provide an easy starting point
- Horizontal scalability as data accumulates
- Weak cross-document/cross-record query capability
- Missing rich, multi-statement transactions
- General purpose – weaker availability than eventually-consistent systems, less scalable to large volumes than HDFS, slower and (not transactional) than new in-memory RDBM platforms.
Going for speed: Fast in-memory SQL (aka NewSQL)
You have gigabytes to terabytes of data that needs high-speed transactional access. You have an incoming event stream (think sensors, mobile phones, network access points) and need per-event transactions to compute responses and analytics in real time. Your problem follows a pattern of “ingest, analyze, decide,” where the analytics and the decisions must be calculated per-request and not post-hoc in batch processing.
New in-memory databases like VoltDB offer high-speed ACID transactions, non-proprietary SQL and relational DDL interfaces with low-overhead indexing and materialized view support for fast lookup, millisecond response/decision, and summary.
Fast SQL patterns
- Rich transactions at scale
- Performance over volume: Hundreds of thousands to millions of transactions per second, up to a few terabytes of total data
- Real time dataset-wide summary and aggregation
- Ad-hoc query capability
- Minimize application complexity with in-database transactions
- Familiar SQL and standard tooling
- Fast transactional throughput scaled horizontally across many machines
- Gain throughput and data-set wide aggregations, but give up general OLAP-style queries
- In-memory architecture inappropriate for volumes exceeding a few terabytes.
And more …
Lucene, Solr and ElasticSearch offer wonderful text and document indexing functionality, for example to implement real-time search as users enter terms.
Graph databases like Neo4J, Titan and Tagged organize data by relationships instead of by row or document, enabling powerful traversal and graph query capabilities.
The froth in the data management space is substantial – and our tendency to talk in terms of categories (SQL, NoSQL, NewSQL) vs. problems makes it hard for practitioners to understand what’s in the toolbox. Hopefully this summary helps. The many new databases available are not all alike – and recognizing how the DNA behind each helps or hinders problem solvers is the key to success.