Monday 8 December 2014

SQL Theory: What is a Window Aggregate Function?

What is a Window Aggregate Function?


As I have been presenting at SQL Saturdays and user group events on the window functions and enhancements with SQL Server 2012, I have learned that many SQL Server professionals have not heard about window aggregate functions that have been around since SQL Server 2005. Window aggregate functions allow you to return summary values like SUM, MIN, MAX, COUNT and AVG along with details. In other words, window aggregate functions let you add aggregate calculations to non-aggregate queries.
So, how does this work? Imagine that all the rows are ready to be returned and there is one last chance to perform calculations on these rows before the data is returned to the client.  We can also define the window or set of rows that the aggregate function will perform the calculation on.
The simplest example is to return a grand total over all the rows along with the details. One non-window function method is to use a sub-query within the SELECT clause as shown in the next example. NOTE: These examples use the AdventureWorks database.
SELECT  SalesOrderID, CustomerID,OrderDate,  TotalDue,
 (SELECT SUM(TotalDue) FROM Sales.SalesOrderHeader) AS GrandTotal
FROM Sales.SalesOrderHeader;
When using the window aggregate method, the OVER clause allows us to define the window that the function will operate on. When using the OVER clause, empty parentheses mean the entire result set. Using the OVER clause to define a window for the SUM function to work on, the query can be written like this to get the same results:
SELECT  SalesOrderID, CustomerID,OrderDate,  TotalDue,
 SUM(TotalDue) OVER()AS GrandTotal
FROM Sales.SalesOrderHeader)

All window functions require the OVER clause. The OVER clause contains the definition of the window for the function to operate on, and empty parentheses mean the window consists of the entire result set. The next thing you can do is to create smaller windows by partitioning on one or more of the columns. For example, you may want to get a sub-total of each customer’s orders. To do this without window functions, you could write a correlated sub-query as shown in the next example:
SELECT  SalesOrderID, CustomerID,OrderDate,  TotalDue,
     (SELECT SUM(TotalDue)
      FROM Sales.SalesOrderHeader InnerQuery
      WHERE  InnerQuery.CustomerID = OuterQuery.CustomerID)AS CustomerTotal
FROM Sales.SalesOrderHeader OuterQuery
ORDER BY CustomerID;
 
To accomplish the same thing with window aggregates, you add the PARTITION BY option to the OVER clause. Now, instead of one window for the calculation, you have multiple, smaller windows defined. It’s not the official terminology, but I like to think of these as panes within the larger window. The following query demonstrates how to use the PARTITION BY clause.

SELECT  SalesOrderID, CustomerID,OrderDate,  TotalDue,
 SUM(TotalDue) OVER(PARTITION BY CustomerID) AS CustomerTotal
FROM Sales.SalesOrderHeader OuterQuery
ORDER BY CustomerID;
Window aggregate functions can be used in calculations like any other function and you can have multiple window aggregate functions that operate on differently defined windows in the same query. For example, you can write a query that contains both the Grand Total and the Customer Total. The following is a more complex example:
SELECT  SalesOrderID, CustomerID,OrderDate,  TotalDue,
 SUM(TotalDue) OVER(PARTITION BY CustomerID) AS CustomerTotal,
 SUM(TotalDue) OVER() AS GrandTotal,
 AVG(TotalDue) OVER(PARTITION BY CustomerID) AS AvgCustSale
FROM Sales.SalesOrderHeader OuterQuery
ORDER BY CustomerID;
 
Another thing you can do is apply a window function to the record set of an aggregate query. For example, say you wanted to get a list of CustomerIDs with the total for each customer but you also need to see the grand total. This is a bit trickier, though. My first attempt at writing this query resulted in an error:
SELECT CustomerID, SUM(TotalDue) AS CustomerTotal,
 SUM(TotalDue) OVER() AS GrandTotal
FROM Sales.SalesOrderHeader
GROUP BY CustomerID;

This seems odd since the query does have the TotalDue column within an aggregate function, but that is not what we have asked the window aggregate function to operate on. We really want a sum of all the sums from each CustomerID. To fix this, change the window aggregate function to calculate a sum of the sums:
SELECT CustomerID, SUM(TotalDue) AS CustomerTotal,
 SUM(SUM(TotalDue)) OVER() AS GrandTotal
FROM Sales.SalesOrderHeader
GROUP BY CustomerID;
 

Window aggregate functions allow you to write queries in a much simpler way. They can only appear in the SELECT and ORDER BY clauses and operate after the FROM, WHERE, GROUP BY and HAVING clauses.  All functionality demonstrated in this post was introduced in 2005. So, what are you waiting for? Start thinking about using window aggregate functions in your queries today!


4.13. Window Functions

Window functions perform calculations across rows of the query result. They run after the HAVING clause but before the ORDER BY clause. Invoking a window function requires special syntax using the OVER clause to specify the window. A window has three components:
  • The partition specification, which separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions.
  • The ordering specification, which determines the order in which input rows will be processed by the window function.
  • The window frame, which specifies a sliding window of rows to be processed by the function for a given row.
For example, the following query ranks orders for each clerk by price:
SELECT orderid, clerk, totalprice,
       RANK() OVER (PARTITION BY clerk
                    ORDER BY totalprice DESC) AS rnk
FROM orders
ORDER BY clerk, rnk

Ranking Functions

cume_dist() → bigint
Returns the cumulative distribution of a value in a group of values. The result is the number of rows preceding or peer with the row in the window ordering of the window partition divided by the total number of rows in the window partition. Thus, any tie values in the ordering will evaluate to the same distribution value.
dense_rank() → bigint
Returns the rank of a value in a group of values. This is similar to rank(), except that tie values do not produce gaps in the sequence.
ntile(n) → bigint
Divides the rows for each window partition into n buckets ranging from 1 to at most n. Bucket values will differ by at most 1. If the number of rows in the partition does not divide evenly into the number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket.
For example, with 6 rows and 4 buckets, the bucket values would be as follows: 1 1 2 2 3 4
percent_rank() → bigint
Returns the percentage ranking of a value in group of values. The result is (r - 1) / (n - 1) where r is the rank() of the row and nis the total number of rows in the window partition.
rank() → bigint
Returns the rank of a value in a group of values. The rank is one plus the number of rows preceding the row that are not peer with the row. Thus, tie values in the ordering will produce gaps in the sequence. The ranking is performed for each window partition.
row_number() → bigint
Returns a unique, sequential number for each row, starting with one, according to the ordering of rows within the window partition.

Value Functions

Warning
These functions do not respect the window frame (default or specified) and instead operate on the entire partition. This is equivalent to RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. In a future release, these functions will be changed to respect the window frame and to use the standard default window frame RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
first_value(x) → [same as input]
Returns the first value of the window (see warning above).
last_value(x) → [same as input]
Returns the last value of the window (see warning above).
nth_value(xoffset) → [same as input]
Returns the value at the specified offset from beginning the window (see warning above). Offsets start at 1. The offset can be any scalar expression. If the offset is null or greater than the number of values in the window, null is returned. It is an error for the offset to be zero or negative.
lead(x[offset[default_value]]) → [same as input]
Returns the value at offset rows after the current row in the window (see warning above). Offsets start at 0, which is the current row. The offset can be any scalar expression. The default offset is 1. If the offset is null or larger than the window, the default_value is returned, or if it is not specified null is returned.
lag(x[offset[default_value]]) → [same as input]
Returns the value at offset rows before the current row in the window (see warning above). Offsets start at 0, which is the current row. The offset can be any scalar expression. The default offset is 1. If the offset is null or larger than the window, the default_value is returned, or if it is not specified null is returned.

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home