Over the last decade, SQL has fallen out of fashion and is now seeing a resurgence. As applications become more complex, people realize they need a data manipulation language, and SQL is the obvious choice as it’s a proven and well-known mechanism for manipulating related data structures. In this blog post, we discuss how VoltDB has implemented materialized views, and why they are fast and incredibly useful.
What’s the difference between a materialized view and a normal view?
There are two kinds of views: Normal views and materialized views. VoltDB uses materialized views, but because that’s the only kind of view we use, we tend to refer to them as just ‘views’.
A ‘normal’ view is an SQL SELECT statement pretending to be a table. In most cases, views are read-only. This is useful because real-world SELECT statements involving multiple tables can be remarkably complicated, and a materialized view means that the complex logic only has to be implemented once and can then be used multiple times within an application.
A materialized view is like a ‘normal’ view, but with the results cached.
Why were earlier implementations of ‘normal’ views problematic?
Views were initially popular but fell out of favor over time. The views most people wanted included features such as GROUP BY and SUM(), and this meant that the view had to traverse a large amount of data each time it was accessed.
So, if I had a view that calculated ‘sales per customer’ and wanted to know sales for customer ‘x’, when I accessed the view I would calculate sales for all of the customers, and then filter out the one row for customer ‘x’.
This is obviously very slow. Some database vendors tried to get around this by using clever query optimizers that merged the view’s SQL with the query you were trying to run, but by this time people had started to perceive views as being more trouble than they were worth.
Why materialized views aren’t easy to do well
Given that users wanted views that did GROUP BY and SUM(), but ‘normal’ views were slow, the obvious fix was to cache the results of a view’s SELECT statement, and this is where materialized views came from.
However, in a legacy RDBMS, moving to materialized views replaces the poor SELECT performance on the view with really poor UPDATE and INSERT performance on the underlying table.
Imagine a table with 300 million rows that tracks the color of a car someone drives. Inserting a row into that table is relatively easy, as we have one row per person and conflicts are thus unlikely. But if we add a materialized view to GROUP BY ‘car_color’ we suddenly find our inserts unable to finish until they have queued up to update the row in the materialized view for ‘red’, ‘blue’, or any other popular color.
This overhead is bad in a single-node system and catastrophic in a clustered environment, as the row for ‘blue’ may be on a different server, and we find that what started out as a single row insert is now a complicated two-phase commit transaction.
Why is VoltDB’s implementation of materialized views so good?
VoltDB was created by people who wanted to build a really fast OLTP system, but who also wanted to be able to calculate aggregates and totals at the same time. It would help to take a deeper dive into VoltDB’s architecture, but for the purposes of this blog, what you need to know is:
- VoltDB horizontally partitions the data by CPU core, and has a dedicated thread for each CPU core that handles all the requests, sequentially. This avoids contention when we try to update materialized view totals.
- Materialized views in VoltDB are tables that are automatically updated as transactions happen.
- As each request is processed, VoltDB identifies any changes needed to its local part of the materialized view, and makes them. So if the view is called SALES_BY_CUSTOMER, each partition will have its own subset, and when a sale happens it will do an update at the same time as it inserts a row into SALES.
- Updating a materialized view for each and every transaction sounds like it should be slow, but in practice each view slows down SQL by about 1-5% per transaction.
- If I update table ‘X’ and it has a view ‘Y, then the changes to ‘X’ and ‘Y’ are part of the same transaction. Materialized views in VoltDB never need to be recalculated and always reflect current reality.
- The cost to read a materialized view in VoltDB is therefore almost the same as the cost of reading a table: <1 millisecond.
As you can see, materialized views aren’t easy to do well, but if you can do them well they provide true instant visibility into aggregate data, making your life just that much easier, but let’s now get into the specific uses cases for VoltDB materialized views.
Materialized views: the streaming aggregation use case
One common pattern is to use materialized views to aggregate high-speed event streams by seconds and then use SQL queries to further aggregate by minutes, hours, or other units. This allows fast aggregation of time-series data at different levels of granularity.
Here’s an example excerpted from the windowing application included in the VoltDB download kit that calculates an average from a user-specified time range:
|CREATE TABLE timedata
uuid VARCHAR(36) NOT NULL, val BIGINT NOT NULL,
update_ts TIMESTAMP NOT NULL
CREATE VIEW agg_by_second
AS SELECT TRUNCATE(SECOND, update_ts), COUNT(*), SUM(val)
GROUP BY TRUNCATE(SECOND, update_ts);
— Find the average value over all tuples across all partitions for the last
— N seconds, where N is a parameter the user supplies.
CREATE PROCEDURE Average AS
SELECT SUM(sum_values) / SUM(count_values)
WHERE second_ts >= TO_TIMESTAMP(SECOND, SINCE_EPOCH(SECOND, NOW) – ?);
VoltDB SQL supports functions to convert timestamps to SECOND, MINUTE, HOUR, DAY, DAYOFMONTH, DAYOFYEAR, MONTH, WEEK, WEEKOFYEAR, WEEKDAY, and YEAR, all of which can be mixed and matched with view definitions and queries.
Materialized views: the scaling use case
Since materialized views scale just like the rest of VoltDB, you can distribute the cost of calculating summaries across a fast incoming write/update workload and make it very cheap to rapidly read summaries of state.
In the example above, if tuples are being inserted at a rate of 15k/sec and there are four partitions, then to compute the average for the last ten seconds, VoltDB would need to scan 150k rows. Using a materialized view, it needs to scan 1 row per second times the number of partitions, or 40 rows. There’s a tremendous advantage in pre-aggregating the table sums and counts by second.
How to combine streaming analytics with per-event decisions
Since materialized views are always up to date, transactionally consistent, and accessible via standard SQL, it’s easy to use aggregations maintained in views when running transactions.
Our voter example does exactly this. The example maintains a materialized view that groups incoming records by phone number. The transaction that enforces a per-phone number voting limit uses the view to look up the previous calls registered by the incoming phone number. A more sophisticated example could use the time-series functions discussed above to enforce a max-accesses per-minute for high-speed quota enforcement.
Now let’s use the voter example to show runtime benefits of amortizing the aggregation cost across the scalable ingest workload. The example includes a heat-map that graphs which contestant leads the voting in each state. Querying the data for this heat-map requires counting votes by state, of course.
Using a materialized view, this can be done in sub-millisecond times (it requires reading 50 rows at each partition for each of the six contestants).
Without a materialized view, a scan of all votes is necessary and requires considerable execution time (and requires reading millions rows).
Here are some concrete numbers showing execution times for roughly 10GB of data (~130M rows) per-partition on a recent Intel core i7 CPU (Without the materialized view, all 130M rows are scanned, and execution time is about 8 seconds):
|26> select state, count(*) from votes where state=’MA’ group by state;
(Returned 1 rows in 8.13s)
With the view execution time is sub-millisecond:
31> select state, sum(num_votes) from v_votes_by_contestant_number_state where state
= ‘MA’ group by state;
(Returned 1 rows in 0.00s)
The view implementation has some impact on the ingest workload, but even with the materialized view, the database can process more than 40k requests per second per core. The view overhead can be compensated for with horizontal scalability—the work required can be scaled across cores and across servers in a cluster. This trade-off enables the use of real-time aggregation in per-event transactions at high throughput.
1 partition throughput with v_votes_by_contestant_number_state view:
|Throughput 42252/s, Aborts/Failures 0/0
Throughput 45407/s, Aborts/Failures 0/0
Throughput 46352/s, Aborts/Failures 0/0
Throughput 44341/s, Aborts/Failures 0/0
1 partition throughput without v_votes_by_contestant_number_state_view:
|Throughput 50525/s, Aborts/Failures 0/0
Throughput 52503/s, Aborts/Failures 0/0
Throughput 50961/s, Aborts/Failures 0/0
Throughput 51190/s, Aborts/Failures 0/0
VoltDB is an in-memory, ACID, scale-out, SQL database. It can process tens of thousands to millions of transactions per second. It is also capable of real-time SQL analytics against high-velocity data streams.
An important part of VoltDB’s real-time analytics capabilities is built around its materialized view implementation. VoltDB materialized views support real-time aggregation and summary and support combining real-time analytics with per-event decisions.