SQL Theory: Window Offset Functions, Window Distribution Functions
Window Offset Functions
Window offset functions let you return a value from a row that's in a certain offset from the current row (LAG and LEAD) or from the first or last row in the window frame (FIRST_VALUE and LAST_VALUE). Let's start with the LAG and LEAD functions.
Window offset functions LAG and LEAD. The LAG function returns a value from a row in the window partition that, based on the window ordering, is the specified number of rows before the current row. Similarly, the LEAD function returns a value from a row in the window partition that, based on the window ordering, is the specified number of rows after the current row. By default, the LAG and LEAD functions assume an offset of one row if an explicit offset wasn't specified.
You indicate the value you want to return from the row in question as the first argument to LAG and LEAD. If you want to indicate an explicit offset, you indicate it as the second argument to the function. If a row isn't found in the requested offset, the functions return a NULL. If you want to return a different value in case a row isn't found, you can indicate such a value as the third argument to the function.
As an example, the following query returns for each customer order the value of the customer's previous order (LAG), as well as the value of the customer's next order (LEAD):
USE TSQL2012;
SELECT custid, orderdate, orderid, val,
LAG(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS prevval,
LEAD(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS nextval
FROM Sales.OrderValues;
SELECT custid, orderdate, orderid, val,
LAG(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS prevval,
LEAD(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS nextval
FROM Sales.OrderValues;
Figure 1 shows the output of this query.
Figure 1: Output of query using LAG and LEAD with default offset
custid orderdate orderid val prevval nextval
------- ----------- ------- ------- ------- --------
1 2007-08-25 10643 814.50 NULL 878.00
1 2007-10-03 10692 878.00 814.50 330.00
1 2007-10-13 10702 330.00 878.00 845.80
1 2008-01-15 10835 845.80 330.00 471.20
1 2008-03-16 10952 471.20 845.80 933.50
1 2008-04-09 11011 933.50 471.20 NULL
2 2006-09-18 10308 88.80 NULL 479.75
2 2007-08-08 10625 479.75 88.80 320.00
2 2007-11-28 10759 320.00 479.75 514.40
2 2008-03-04 10926 514.40 320.00 NULL
...
------- ----------- ------- ------- ------- --------
1 2007-08-25 10643 814.50 NULL 878.00
1 2007-10-03 10692 878.00 814.50 330.00
1 2007-10-13 10702 330.00 878.00 845.80
1 2008-01-15 10835 845.80 330.00 471.20
1 2008-03-16 10952 471.20 845.80 933.50
1 2008-04-09 11011 933.50 471.20 NULL
2 2006-09-18 10308 88.80 NULL 479.75
2 2007-08-08 10625 479.75 88.80 320.00
2 2007-11-28 10759 320.00 479.75 514.40
2 2008-03-04 10926 514.40 320.00 NULL
...
Because the calculations are supposed to be performed for each customer independently, the functions partition the window by custid. As for window ordering, it's based on orderdate and orderid as a tiebreaker. Observe that the functions rely on the default offset 1 and return NULL when a row isn't found in the applicable offset.
You can freely mix in the same expression detail elements from the row as well as calls to window functions. For example, the following query computes the difference between the customer's current and previous order values, as well as the difference between the customer's current and next order values:
SELECT custid, orderdate, orderid, val,
val - LAG(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS diffprev,
val - LEAD(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS diffnext
FROM Sales.OrderValues;
val - LAG(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS diffprev,
val - LEAD(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS diffnext
FROM Sales.OrderValues;
Figure 2 shows the output of this query.
Figure 2: Output of query using expressions with LAG and LEAD
custid orderdate orderid val diffprev diffnext
------- ----------- ------- ------ ---------- ---------
1 2007-08-25 10643 814.50 NULL -63.50
1 2007-10-03 10692 878.00 63.50 548.00
1 2007-10-13 10702 330.00 -548.00 -515.80
1 2008-01-15 10835 845.80 515.80 374.60
1 2008-03-16 10952 471.20 -374.60 -462.30
1 2008-04-09 11011 933.50 462.30 NULL
2 2006-09-18 10308 88.80 NULL -390.95
2 2007-08-08 10625 479.75 390.95 159.75
2 2007-11-28 10759 320.00 -159.75 -194.40
2 2008-03-04 10926 514.40 194.40 NULL
...
------- ----------- ------- ------ ---------- ---------
1 2007-08-25 10643 814.50 NULL -63.50
1 2007-10-03 10692 878.00 63.50 548.00
1 2007-10-13 10702 330.00 -548.00 -515.80
1 2008-01-15 10835 845.80 515.80 374.60
1 2008-03-16 10952 471.20 -374.60 -462.30
1 2008-04-09 11011 933.50 462.30 NULL
2 2006-09-18 10308 88.80 NULL -390.95
2 2007-08-08 10625 479.75 390.95 159.75
2 2007-11-28 10759 320.00 -159.75 -194.40
2 2008-03-04 10926 514.40 194.40 NULL
...
As I mentioned, the default when an explicit offset isn't specified is 1 -- but you can indicate your own value as a second argument to the function. You can also indicate as a third argument what to return instead of a NULL when a row isn't found in the requested offset. Here's an example that specifies 2 as the offset and 0 as the value to return when a row isn't found:
SELECT custid, orderdate, orderid, val,
LAG(val, 2, 0) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS prev2val,
LEAD(val, 2, 0) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS next2val
FROM Sales.OrderValues;
LAG(val, 2, 0) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS prev2val,
LEAD(val, 2, 0) OVER(PARTITION BY custid
ORDER BY orderdate, orderid) AS next2val
FROM Sales.OrderValues;
Figure 3 shows the output of this query.
Figure 3: Output of query using LAG and LEAD with nondefault offset
custid orderdate orderid val prev2val next2val
------- ----------- ------- ------ --------- --------
1 2007-08-25 10643 814.50 0.00 330.00
1 2007-10-03 10692 878.00 0.00 845.80
1 2007-10-13 10702 330.00 814.50 471.20
1 2008-01-15 10835 845.80 878.00 933.50
1 2008-03-16 10952 471.20 330.00 0.00
1 2008-04-09 11011 933.50 845.80 0.00
2 2006-09-18 10308 88.80 0.00 320.00
2 2007-08-08 10625 479.75 0.00 514.40
2 2007-11-28 10759 320.00 88.80 0.00
2 2008-03-04 10926 514.40 479.75 0.00
...
------- ----------- ------- ------ --------- --------
1 2007-08-25 10643 814.50 0.00 330.00
1 2007-10-03 10692 878.00 0.00 845.80
1 2007-10-13 10702 330.00 814.50 471.20
1 2008-01-15 10835 845.80 878.00 933.50
1 2008-03-16 10952 471.20 330.00 0.00
1 2008-04-09 11011 933.50 845.80 0.00
2 2006-09-18 10308 88.80 0.00 320.00
2 2007-08-08 10625 479.75 0.00 514.40
2 2007-11-28 10759 320.00 88.80 0.00
2 2008-03-04 10926 514.40 479.75 0.00
...
Window offset functions FIRST_VALUE and LAST_VALUE. The functions FIRST_VALUE and LAST_VALUE return the requested value from the first and last rows, respectively, from the applicable window frame. In "SQL Server 2012's Window Functions, Part 1," I described the concept of a window frame in detail. Quite often, you just want to return the first and last values from the window partition in general and not necessarily from a more restricted window frame. However, this can be a bit tricky to achieve. First, let me provide a query that correctly returns along with each customer's order the values of the customer's first and last orders using the FIRST_VALUE and LAST_VALUE functions, respectively:
SELECT custid, orderdate, orderid, val,
FIRST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid
ROWS UNBOUNDED PRECEDING)
AS val_firstorder,
LAST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid
ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING)
AS val_lastorder
FROM Sales.OrderValues;
FIRST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid
ROWS UNBOUNDED PRECEDING)
AS val_firstorder,
LAST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid
ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING)
AS val_lastorder
FROM Sales.OrderValues;
Figure 4 shows the output of this query.
Figure 4: Output of query with FIRST_VALUE and LAST_VALUE
custid orderdate orderid val val_firstorder val_lastorder
------- ----------- ------- ------- --------------- --------------
1 2007-08-25 10643 814.50 814.50 933.50
1 2007-10-03 10692 878.00 814.50 933.50
1 2007-10-13 10702 330.00 814.50 933.50
1 2008-01-15 10835 845.80 814.50 933.50
1 2008-03-16 10952 471.20 814.50 933.50
1 2008-04-09 11011 933.50 814.50 933.50
2 2006-09-18 10308 88.80 88.80 514.40
2 2007-08-08 10625 479.75 88.80 514.40
2 2007-11-28 10759 320.00 88.80 514.40
2 2008-03-04 10926 514.40 88.80 514.40
...
------- ----------- ------- ------- --------------- --------------
1 2007-08-25 10643 814.50 814.50 933.50
1 2007-10-03 10692 878.00 814.50 933.50
1 2007-10-13 10702 330.00 814.50 933.50
1 2008-01-15 10835 845.80 814.50 933.50
1 2008-03-16 10952 471.20 814.50 933.50
1 2008-04-09 11011 933.50 814.50 933.50
2 2006-09-18 10308 88.80 88.80 514.40
2 2007-08-08 10625 479.75 88.80 514.40
2 2007-11-28 10759 320.00 88.80 514.40
2 2008-03-04 10926 514.40 88.80 514.40
...
You're probably wondering: Why the lengthy window descriptors? More specifically, why the need for an explicit ROWS clause if the request isn't concerned with a more restricted window frame, but rather the partition in general?
This has to do with the fact that the window order clause is in essence only part of the framing specification, and if an explicit window frame isn't specified, the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. For the FIRST_VALUE function, relying on the default frame would still yield the correct result because the first row in the default window frame is the first row in the window partition. Still, last month I gave a recommendation to stick to ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW instead of the default RANGE option when possible.
As for the LAST_VALUE function, try to think what it would mean to rely on the default framing option. It means that you will basically always get the value from the current row, because that's the last row in the default window frame. So for LAST_VALUE, you really need to be explicit about the frame specification and indicate the ROWS unit, CURRENT ROW as the lower bound and UNBOUNDED FOLLOWING as the upper bound (assuming you want the value from the last row in the window partition).
Just as I showed with LAG and LEAD, FIRST_VALUE and LAST_VALUE can likewise be mixed in expressions that also involve detail elements from the row. For example, the following query returns the difference between the customer's current and first order values, as well as the difference between the customer's current and last order values:
SELECT custid, orderdate, orderid, val,
val - FIRST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid
ROWS UNBOUNDED PRECEDING) AS difffirst,
val - LAST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid
ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) AS difflast
FROM Sales.OrderValues;
val - FIRST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid
ROWS UNBOUNDED PRECEDING) AS difffirst,
val - LAST_VALUE(val) OVER(PARTITION BY custid
ORDER BY orderdate, orderid
ROWS BETWEEN CURRENT ROW
AND UNBOUNDED FOLLOWING) AS difflast
FROM Sales.OrderValues;
Figure 5 shows the output of this query.
Figure 5: Output of query with expressions involving FIRST_VALUE and LAST_VALUE
custid orderdate orderid val difffirst difflast
------- ----------- ------- ------- --------- ---------
1 2007-08-25 10643 814.50 0.00 -119.00
1 2007-10-03 10692 878.00 63.50 -55.50
1 2007-10-13 10702 330.00 -484.50 -603.50
1 2008-01-15 10835 845.80 31.30 -87.70
1 2008-03-16 10952 471.20 -343.30 -462.30
1 2008-04-09 11011 933.50 119.00 0.00
2 2006-09-18 10308 88.80 0.00 -425.60
2 2007-08-08 10625 479.75 390.95 -34.65
2 2007-11-28 10759 320.00 231.20 -194.40
2 2008-03-04 10926 514.40 425.60 0.00
...
------- ----------- ------- ------- --------- ---------
1 2007-08-25 10643 814.50 0.00 -119.00
1 2007-10-03 10692 878.00 63.50 -55.50
1 2007-10-13 10702 330.00 -484.50 -603.50
1 2008-01-15 10835 845.80 31.30 -87.70
1 2008-03-16 10952 471.20 -343.30 -462.30
1 2008-04-09 11011 933.50 119.00 0.00
2 2006-09-18 10308 88.80 0.00 -425.60
2 2007-08-08 10625 479.75 390.95 -34.65
2 2007-11-28 10759 320.00 231.20 -194.40
2 2008-03-04 10926 514.40 425.60 0.00
...
As food for thought, try to think why a grouped version (as opposed to the existing windowed version) of the FIRST_VALUE and LAST_VALUE functions would have made perfect sense, although -- alas -- there's no support for such a version. Also, can you think of a workaround that would achieve the same result?
Window Distribution Functions
Window distribution functions provide statistical computations. SQL Server 2012 implements two window rank distribution functions called PERCENT_RANK and CUME_DIST and two window inverse distribution functions called PERCENTILE_DISC and PERCENTILE_CONT. I'll start with window rank distribution functions.
To explain what the PERCENT_RANK and CUME_DIST functions calculate, I need to first provide a definition of the elements involved in their internal computation. Let rk be the rank of the row (using the same partitioning and ordering as the window function's partitioning and ordering). Let nr be the number of rows in the partition. Let np be the number of rows that precede or peer with the current row.
Then, the PERCENT_RANK computes a percentile rank as (rk - 1) / (nr - 1), and CUME_DIST computes a cumulative distribution as np / nr. As an example, the following query computes the percentile rank and cumulative distribution of student test scores, partitioned by testid and ordered by score:
SELECT testid, studentid, score,
PERCENT_RANK() OVER(PARTITION BY testid
ORDER BY score) AS percentrank,
CUME_DIST() OVER(PARTITION BY testid
ORDER BY score) AS cumedist
FROM Stats.Scores;
PERCENT_RANK() OVER(PARTITION BY testid
ORDER BY score) AS percentrank,
CUME_DIST() OVER(PARTITION BY testid
ORDER BY score) AS cumedist
FROM Stats.Scores;
Figure 6 shows the output of this query.
Figure 6: Output of query with PERCENT_RANK and CUME_DIST
testid studentid score percentrank cumedist
---------- ---------- ----- ------------ ---------
Test ABC Student E 50 0.000 0.111
Test ABC Student C 55 0.125 0.333
Test ABC Student D 55 0.125 0.333
Test ABC Student H 65 0.375 0.444
Test ABC Student I 75 0.500 0.556
Test ABC Student F 80 0.625 0.778
Test ABC Student B 80 0.625 0.778
Test ABC Student A 95 0.875 1.000
Test ABC Student G 95 0.875 1.000
Test XYZ Student E 50 0.000 0.100
Test XYZ Student C 55 0.111 0.300
Test XYZ Student D 55 0.111 0.300
Test XYZ Student H 65 0.333 0.400
Test XYZ Student I 75 0.444 0.500
Test XYZ Student B 80 0.556 0.700
Test XYZ Student F 80 0.556 0.700
Test XYZ Student G 95 0.778 1.000
Test XYZ Student J 95 0.778 1.000
Test XYZ Student A 95 0.778 1.000
---------- ---------- ----- ------------ ---------
Test ABC Student E 50 0.000 0.111
Test ABC Student C 55 0.125 0.333
Test ABC Student D 55 0.125 0.333
Test ABC Student H 65 0.375 0.444
Test ABC Student I 75 0.500 0.556
Test ABC Student F 80 0.625 0.778
Test ABC Student B 80 0.625 0.778
Test ABC Student A 95 0.875 1.000
Test ABC Student G 95 0.875 1.000
Test XYZ Student E 50 0.000 0.100
Test XYZ Student C 55 0.111 0.300
Test XYZ Student D 55 0.111 0.300
Test XYZ Student H 65 0.333 0.400
Test XYZ Student I 75 0.444 0.500
Test XYZ Student B 80 0.556 0.700
Test XYZ Student F 80 0.556 0.700
Test XYZ Student G 95 0.778 1.000
Test XYZ Student J 95 0.778 1.000
Test XYZ Student A 95 0.778 1.000
As an exercise, try to write SQL Server 2008-compatible solutions that compute percentile rank and cumulative distribution.
SQL Server 2012 also implements two window inverse distribution functions called PERCENTILE_DISC and PERCENTILE_CONT. A percentile p, loosely speaking, is the value v from the population, such that p percent of the values are less than v. For example, if 50 percent of the values in the population are less than some value v, then v is the 50th percentile, also known as the median.
The two function variations implement two different distribution models. The PERCENTILE_DISC function implements a discrete distribution model in which the returned value must be one of the values from the population. The PERCENTILE_CONT function implements a continuous distribution model in which the returned value is interpolated from the existing values, assuming continuous distribution.
As a simple example, in case of an even number of values, PERCENTILE_CONT will compute the median as the average of the two middle points. It gets far trickier to understand the interpolation when the requested percentile isn't the median, but fortunately the function has this complexity embedded into it.
For example, the following code computes the median of student test scores per test using both PERCENTILE_DISC and PERCENTILE_CONT:
DECLARE @pct AS FLOAT = 0.5;
SELECT testid, score,
PERCENTILE_DISC(@pct) WITHIN GROUP(ORDER BY score)
OVER(PARTITION BY testid) AS percentiledisc,
PERCENTILE_CONT(@pct) WITHIN GROUP(ORDER BY score)
OVER(PARTITION BY testid) AS percentilecont
FROM Stats.Scores;
SELECT testid, score,
PERCENTILE_DISC(@pct) WITHIN GROUP(ORDER BY score)
OVER(PARTITION BY testid) AS percentiledisc,
PERCENTILE_CONT(@pct) WITHIN GROUP(ORDER BY score)
OVER(PARTITION BY testid) AS percentilecont
FROM Stats.Scores;
Figure 7 shows the output of this query.
Figure 7: Output of query with PERCENTILE_DISC and PERCENTILE_CONT
testid score percentiledisc percentilecont
---------- ----- -------------- ----------------------
Test ABC 50 75 75
Test ABC 55 75 75
Test ABC 55 75 75
Test ABC 65 75 75
Test ABC 75 75 75
Test ABC 80 75 75
Test ABC 80 75 75
Test ABC 95 75 75
Test ABC 95 75 75
Test XYZ 50 75 77.5
Test XYZ 55 75 77.5
Test XYZ 55 75 77.5
Test XYZ 65 75 77.5
Test XYZ 75 75 77.5
Test XYZ 80 75 77.5
Test XYZ 80 75 77.5
Test XYZ 95 75 77.5
Test XYZ 95 75 77.5
Test XYZ 95 75 77.5
---------- ----- -------------- ----------------------
Test ABC 50 75 75
Test ABC 55 75 75
Test ABC 55 75 75
Test ABC 65 75 75
Test ABC 75 75 75
Test ABC 80 75 75
Test ABC 80 75 75
Test ABC 95 75 75
Test ABC 95 75 75
Test XYZ 50 75 77.5
Test XYZ 55 75 77.5
Test XYZ 55 75 77.5
Test XYZ 65 75 77.5
Test XYZ 75 75 77.5
Test XYZ 80 75 77.5
Test XYZ 80 75 77.5
Test XYZ 95 75 77.5
Test XYZ 95 75 77.5
Test XYZ 95 75 77.5
As an exercise, see if you can figure out how to implement a SQL Server 2008-compatible solution for both functions, given any percent as input.
Still More to Come
This article is the second in a series of articles about window functions. Last month, I introduced the concept of windowing. This month, I covered window offset functions and window distribution functions. Next month, I'll cover optimization of window functions.
0 Comments:
Post a Comment
Subscribe to Post Comments [Atom]
<< Home