I am Bill White and I am part of the team at VoltDB dedicated to implementing the SQL language. Today I’d like to share with you a new SQL feature we have implemented in VoltDB 6.6 and 6.7.
1. What’s the problem?
Imagine you are tasked with deciding who has won a contest of some sort. Perhaps it’s an election or perhaps it’s an experiment. There are scoring rules and each contestant gets a score based on their performance. Your job is to calculate the top three contestants based on their scores. Just to be definite, a larger score is better and a score of 100 is the best. We will also assume that all players are on some team and that the team has a name. A simple minded SQL table for this might be defined this way:
CREATE TABLE players ( id INTEGER NOT NULL, name VARCHAR NOT NULL, team VARCHAR NOT NULL, score INTEGER NOT NULL, CONSTRAINT pk PRIMARY KEY (id, team) ); PARTITION TABLE players ON COLUMN team;
The SQL solution seems easy enough – just order the contestant table on the score column and read the first three answers. Maybe the answer is as simple as:
SELECT name, score FROM players ORDER BY score LIMIT 3;
But is this really right? What do we want to do for ties?
- What if there are 5 contestants whose score is 100? How will we decide among them? One plausible way, in the absence of other deciding factors, would be to list all 5 as number one, awarding 5 first place prizes.
- If the first two scores are both 100, and the next is 99 is the contestant with score 99 the second or the third? In a baseball league the contestant with 99 would be third. In a contest where we care about scores more than contestants who achieve the scores, the 99 would be the second score. This might happen if we were looking at high temperatures in cities, and we care about the temperature numbers more than the cities that attained them. For shorthand in this note, call the first ranking the baseball ranking, and the second the temperature ranking.
To complicate the matter, further imagine that we don’t just want to know the winner over all players, but that we will also want to know the first, second and third best in each team. This complicates the problem quite a bit. We don’t really know how to use SQL to aggregate by teams to sort them. Perhaps we could sort by team and then by score, but how would we limit the output to the top three in each team without multiple queries?
If one was solving these problems without a computer, one might put all player’s names, teams and scores on index cards and sort the cards by score. One would then draw out all the cards with the highest score, then all the cards with the next highest score and so on until three or more cards had been drawn. For the second problem, the top three by team, we would do the same thing, but we would first group the players by team and sort the grouped card piles. This is exactly what we will do but we’ll use a SQL table instead of cards, the SQL windowing specifications PARTITION BY and ORDER BY to group and sort, and the SQL window functions RANK() and DENSE_RANK() to evaluate the top three.
2. How we solve it
In VoltDB 6.6 we have taken a first step toward solving problems like this, by implementing more of the SQL standard. The solution is window aggregate functions, or window functions for short. These are sometimes called over functions, because they use the keyword OVER in their syntax. They are a standard SQL 2003 feature.
The idea is to group the table into one or more disjoint subsets, sort the individual groups, and apply aggregation functions to the groups. We may group the table into one subset containing all rows, or into several disjoint subsets.
2.1. A concrete but simplified version
Let’s take as an example a simplification of the first problem above. We just want to find what order all the players are in. We don’t want to select the top ones right now. The query is:
-- Q.GlobalRank SELECT RANK() OVER ( ORDER BY score DESC ) AS rnk, score, name, team FROM players ORDER BY rnk ASC, score, name, team;
The new bit of syntax is “RANK() OVER ( … )”. This syntax can be broken down into two parts. It will be easier to understand by reading read backward, from right to left, reading the part after the OVER first, and the part before the OVER last.
- The parenthesized piece after the OVER defines a set of windows, one for each row. In a full SQL 2003 window implementation these windows can be very complicated and useful. For our simplified implementation the window is calculated in two steps.
- Group all the rows of the table into disjoint subsets according to a sequence of expressions. The syntax for this is “PARTITION BY E1, …, En”. These work a bit like GROUP BY expressions group rows. A PARTITION BY group is all rows which have the same values for all of the expressions E1, …, En. But we still retain the individual rows. We don’t merge groups into one row, as we might with a GROUP BY, and there will still be one output row for each input row. We call all rows in a single such group ‘partition by peers’. In this example, since there are no PARTITION BY expressions, there is one big group including all rows. That is to say, all rows are partitioned by peers. We will show an example with a non-trivial PARTITION BY expression list later.
- Order all the rows of the group according to a sequence of expressions, just as we would with a statement level ORDER BY clause. We call all rows in a group which have the same ORDER BY expression values order by peers. Note that the ORDER BY can order ascending (ASC) or descending (DESC), just like a statement level ORDER BY clause. In our example, the sole ORDER BY expression is “score”. So, the order by peers are all rows which have the same score. It needs to be descending because larger scores are better, and deserve smaller rank numbers.
- For each row, R, there is a window which includes all rows in R’s partition by peer group which sort less than or equal to R using the ORDER BY expressions. This includes R, all its order by peers, which may be after R in the sort order, and the rows before R in the sort order. The aggregate function is applied to R’s window. The aggregate function in our example, RANK(), calculates the baseball ranking of each row in the row’s window. So, if the scores of three adjacent PARTITION BY peer rows were 100, 100, and 99, the ranks would be 1, 1, and 3. The windows for rows 1 and 2 are both the same and are equal to the first two rows, since we sort on score and the first two rows have the same score.
To be more definite, consider this input table.
The result of query Q.GlobalRank on this input table is:
All players are included in the ranking, since there is just one group. There are 5 players with a score of 100, so they all get rank 1. The next highest score, which is 99, has rank 6. If we wanted to compute the temperature ranking, rather than the baseball ranking, we could execute the same query but use the aggregate function DENSE_RANK() instead. This orders the rows just as rank does, but ranks the scores in order, rather than the individuals with that score. The query in this case would be:
-- Q.DenseGlobalRank SELECT DENSE_RANK() OVER ( ORDER BY score DESC ) AS rnk, score, name, team FROM players ORDER BY rnk ASC, score, name, team;
This is like the previous query, but we replace RANK() with DENSE_RANK(). The output table is this:
The only difference is the rank numbers. Unlike with the RANK() query, they are densely packed.
2.2. Selecting the top N
Our original task was to select the top three. In this particular case, since we have no reason to prefer any among the estimable Binky, Brickle, Stinky, Zerfle and Zingle, we should list all five of them. These all have rank 1. The next highest ranking contestant has rank 6, which is higher than 3, so they aren’t selected. If there were just two players with score 100, so there were two players with rank 1, the next score would have the next highest rank, which would be 3. There might be several of these rank 3 players. We should list them all. In short, listing players with rank no more than 3 should give us all the players we want. How to do this is a puzzle, though.
One might try to do this using a WHERE clause, something like this:
-- Q.NoLuckWithWhere SELECT RANK() OVER ( ORDER BY score DESC ) AS rnk, score, name, team FROM players WHERE rnk <= 3 -- This is problematic. ORDER BY rnk ASC, score, name, team;
If you try this you discover that the SQL compiler does not like this query. The compiler cannot find the variable rnk in the WHERE clause. This is because rnk is a select list alias, and the SQL standard does not allow select list aliases to be used in WHERE clauses. One might try a HAVING clause, since RANK() is a kind of aggregate, and HAVING clauses are for filtering the results of aggregates. But this will fail in a similar way. It’s not legal to put the window functions anywhere but in the select list, even using aliases. This makes perfect sense if one thinks about the notional order in which SQL executes statements. The select list expressions are evaluated after the WHERE expressions are executed, and any HAVING condition is evaluated after the GROUP BY expressions are calculated and groups are assigned, but before select list expressions. So, the values of the window functions are not available when the WHERE or HAVING expressions need to be evaluated.
We can get around this by using subqueries, at the cost, perhaps, of some performance. In the subquery we can calculate the entire table, which is players joined with team names and also the rank. The host query can then filter this temporary table to extract the top-ranked players. The query would look something like this:
-- Q.GlobalTopThree SELECT rnk as rank, score, player_name, team_name FROM ( SELECT RANK() OVER ( ORDER BY score DESC ) AS rnk, score, name as player_name, team as team_name FROM players ) as tbl WHERE tbl.rnk <= 3 ORDER BY rnk, score, player_name, team_name;
The result is then:
This is exactly what we want – all and only the perfect scores.
If we now change the scores of Binky, Brickle and Stinky to something small, say 50 and run the same query, we get this output:
Again, this is what we would expect. The three players with lowered scores have disappeared, but we include both players with slightly less than perfect scores. Our top 3 is really a top 4, but that’s exactly what we want.
2.3. Using PARTITION BY
Recall that we were asked another question, which was to find the top 3 among individual teams, and not over the entire league. We need to do ranking, but within a single team’s players. We do this using the PARTITION BY clause we discussed above. Let’s explore this as we did above, from simpler to more complex, by looking at the ranks of all players, ranked comparing to the player’s teammates. We go back to the original table, with 5 100-point scorers. The query is:
-- Q.ByTeamRank SELECT RANK() OVER ( PARTITION BY team ORDER BY score DESC ) AS rnk, score, name, team FROM players ORDER BY team, rnk ASC, score, name;
This is really like the first query in section 2.1, labeled Q.GlobalRank. The difference is the clause PARTITION BY team in the RANK() expression. This groups the rows by team, and calculates the ordering and windowing within each group. A partition by peer group is the set of all rows with the same team. An order by peer group is the set of all rows in a single group with the same sorting order.
If we execute this query we get this output table:
Again, this is what we expect. All players on a single team are compared with each other, and players are not compared across teams. Players are given the baseball ranking, because we use RANK() and not DENSE_RANK().
We can use the same subquery trick to calculate the top three in a single team as we did globally. We use a subquery to calculate the ranks, and then use conventional techniques to filter out those with high rank. The query would look like this:
-- Q.TopByTeams SELECT rnk as rank, score, player_name, team_name FROM ( SELECT RANK() OVER ( PARTITION BY TEAM ORDER BY score DESC ) AS rnk, score, name as player_name, team as team_name FROM players ) as tbl WHERE tbl.rnk <= 3 ORDER BY team_name, rnk, score, player_name;
The output table looks like this:
Note that the Manglers and the Hoosiers have a tie for third, so their teams have four lines.
2.4. PARTITION BY, table PARTITIONing and Performance
Note as well that the use of the term PARTITION in window functions is entirely different from and not very closely related to the use of the term PARTITION in VoltDB tables. The latter has to do with how the data in tables is distributed over multiple computing nodes. They are not closely related, though they both use the word PARTITION. To try to avoid confusion, when referring to window functions we uniformly use the term PARTITION BY. When referring to table PARTITIONs we never use PARTITION BY, but tend to use the phrase PARTITION on. The former, PARTITION BY, has to do with aggregation in window functions only. The only relation is that queries which use PARTITION BY expressions that respect table PARTITIONs can be more efficient. By ‘respect table PARTITIONs’ we mean the PARTITION BY expressions must contain the table PARTITION columns. Then the PARTITION BY grouping can be done safely in parallel on the partition by groups. Consequently the sorting for ORDER BY can be done in parallel on the groups.
If you look at the schema you will see that we partition the table on the team column, and then in the query, using a PARTITION BY clause, we form groups of rows with equal teams. Since the query planner will know that the rows in each group created by the window query will always be contained in the same table partition, the query grouping can be done in parallel on separate table partitions. The explanation of query Q.TopByTeams is:
RETURN RESULTS TO STORED PROCEDURE RECEIVE FROM ALL PARTITIONS SEND PARTITION RESULTS TO COORDINATOR SEQUENTIAL SCAN of "TBL" filter by (column#0 <= ?0) Windowed AGGREGATION ops: RANK() ORDER BY (SORT) SEQUENTIAL SCAN of "PLAYERS"
If we read this from the end to the top, we can clearly see that the sequential scans and sorts are done in parallel on the partitions, the results gathered together, and then returned to the stored procedure. All of the aggregate calculation can be done in parallel on the partitions. In addition, the filtering is done in parallel on the partitions. So, the server has only to merge sort the results from the partitions. To illustrate what would happen if table partitions are not respected by a PARTITION BY clauses, consider a query where the PARTITION BY column is the contestants’ names. It would look like this:
--Q.TopByTeamNamePartitionBy SELECT rnk as rank, score, player_name, team_name FROM ( SELECT RANK() OVER ( PARTITION BY name ORDER BY score DESC ) AS rnk, score, name as player_name, team as team_name FROM players ) as tbl WHERE tbl.rnk <= 3;
The explain plan string for this query is:
RETURN RESULTS TO STORED PROCEDURE SEQUENTIAL SCAN of "TBL" filter by (column#0 <= ?0) Windowed AGGREGATION ops: RANK() ORDER BY (SORT) RECEIVE FROM ALL PARTITIONS SEND PARTITION RESULTS TO COORDINATOR SEQUENTIAL SCAN of "PLAYERS"
Again, reading from the bottom to the top, we see that the partitions simply scan the “players” table and send the unprocessed, scanned table to the coordinator. The coordinator does all the grouping, sorting, and filtering. This will be much slower.
3. Some details
There are some technical details which restrict our implementation of window functions. Some of these are temporary, and will be lifted in subsequent versions. The SQL standard definition of the window definition is very complicated, and we implement only a subset of it.
3.1. Implemented Aggregate Functions
We currently only implement the functions RANK() and DENSE_RANK(). All of the aggregate functions in the SQL 2003 standard are legal here according to the standard document, but our first implementation has only the easiest two.
3.2. Restrictions on windowing operations
We currently only implement the two windowing operations PARTITION BY and ORDER BY. As we have seen, PARTITION BY is optional. According to the SQL standard ORDER BY is also optional, but it is required for our implementation. Also, there can be only one ORDER BY expression, and its type must be INTEGER or TIMESTAMP.
3.2.1. Why is ORDER BY required?
We only implement the RANK() family of aggregate functions, and for RANK() and DENSE_RANK() the ORDER BY clause is required. So the ORDER BY clause is needed for our current implementation.
3.2.2. Why Can There Only Be One ORDER BY Expression?
In standard SQL 2003 there is the notion of window frame units. One can define a window as a set of rows whose values fall into a range of values, or else whose row number is in some set of row numbers. The former units are called range units and the latter units are called row units. The SQL 2003 default is to use range units. We don’t implement row units. In fact, there is no syntax at the current time to express row units. For range units, according to the SQL standard rules, there can be only one ORDER BY expression. Furthermore, the order by expression must have integer or timestamp type. By integer type we mean any integer type – TINYINT, SMALLINT, INTEGER or BIGINT.
3.2.3. Where can a window function appear?
Window functions can only appear directly in a select list. They can appear in the select list of a subquery, but this is an indirect appearance. This is a SQL 2003 restriction, and can’t be lifted.
4. The Future
This is just the beginning. We expect to implement as much of the SQL 2003 standard as makes sense for our customers. There are also some performance enhancements we expect to be making. For example, we currently implement all ORDER BY and PARTITION BY expressions by sorting. We could use ordered indexes for ORDER BY and ordered indexes or hash indexes for PARTITION BY. This would make partition by grouping and sorting be instantaneous. We’ll be exploring this and other performance issues in future releases. We hope you find this feature useful, and we’d love to hear about your use of window analytics and your suggestions for future functionality.