Unfortunately, these functions are not ubiquitously available in SQL dialects. As of jOOQ 3.11, they are known to work in these dialects:

Dialect | As aggregate function | As window function |
---|---|---|

MariaDB 10.3.3 | No | Yes |

Oracle 18c | Yes | Yes |

PostgreSQL 11 | Yes | No |

SQL Server 2017 | No | Yes |

Teradata 16 | Yes | No |

Oracle has the most sophisticated implementation, which supports both the ordered set aggregate function, and the window function version:

- Aggregate function:
`PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x)`

- Window function:
`PERCENTILE_DISC (0.5) WITHIN GROUP (ORDER BY x) OVER (PARTITION BY y)`

### Workarounds if the feature is unavailable

Luckily, as soon as an RDBMS supports window functions, we can easily emulate `PERCENTILE_DISC`

using `PERCENT_RANK`

and `FIRST_VALUE`

as follows. We’re using the Sakila database in this example.

**Emulating window functions**

Let’s emulate these first, as it requires a bit less SQL transformations. This query works out of the box in Oracle:

SELECT DISTINCT rating, percentile_disc(0.5) WITHIN GROUP (ORDER BY length) OVER() x1, percentile_disc(0.5) WITHIN GROUP (ORDER BY length) OVER (PARTITION BY rating) x2 FROM film ORDER BY rating;

Yielding

RATING X1 X2 ------------------- G 114 107 NC-17 114 112 PG 114 113 PG-13 114 125 R 114 115

What we can read from this is that the median length of all films is 114 minutes, and the median lengths of films *per rating* range from 107 minutes to 125 minutes. I’ve used `DISTINCT`

because we don’t care about visualising these values on a per-row basis in this case. This also works in SQL Server.

Now, let’s assume we’re using PostgreSQL, which doesn’t support inverse distribution window functions, or MySQL, which doesn’t support inverse distribution functions at all, but both support `PERCENT_RANK`

and `FIRST_VALUE`

. Here’s the complete query:

SELECT DISTINCT rating, first_value(length) OVER ( ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1, first_value(length) OVER ( PARTITION BY rating ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2 FROM ( SELECT rating, length, percent_rank() OVER (ORDER BY length) p1, percent_rank() OVER (PARTITION BY rating ORDER BY length) p2 FROM film ) t ORDER BY rating;

So, we’re doing this in two steps (visual example further down):

`PERCENT_RANK`

: In a derived table, we’re calculating the`PERCENT_RANK`

value, which attributes a rank to each row ordered by length, going from 0 to 1. This makes sense. When looking for the median value, we’re really looking for the value whose`PERCENT_RANK`

is 0.5 or less. When looking for the 90% percentile, we’re looking for the value whose`PERCENT_RANK`

is 0.9 or less`FIRST_VALUE`

: Once we’ve found the`PERCENT_RANK`

, we’re not quite done yet. We need to find the*last*row whose`PERCENT_RANK`

is less or equal to the percentile we’re interested in. I could have used`LAST_VALUE`

, but then I would have needed to resort to using the quite verbose range clause of window functions. Instead, I when ordering the rows by`PERCENT_RANK`

(p1 or p2), I translated all ranks higher than the percentile I’m looking for into`NULL`

using a`CASE`

expression, and then I made sure using`NULLS LAST`

that the percentile I’m looking for will be the first row in the`FIRST_VALUE`

function’s window specification. Easy!

To visualise this, let’s run these queries, which also project the p1 and p2 values respectively:

SELECT length, CASE WHEN p1 <= 0.5 THEN p1 END::numeric(3,2) p1, first_value(length) OVER ( ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1 FROM ( SELECT length, percent_rank() OVER (ORDER BY length) p1 FROM film ) t ORDER BY length;

The result is

length |p1 |x1 | -------|-----|----| 46 |0.00 |114 | 46 |0.00 |114 | 46 |0.00 |114 | 46 |0.00 |114 | 46 |0.00 |114 | 47 |0.01 |114 | ... 113 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | 114 |0.49 |114 | <-- Last row whose PERCENT_RANK is <= 0.5 115 | |114 | 115 | |114 | 115 | |114 | 115 | |114 | 115 | |114 | 115 | |114 | ... 185 | |114 | 185 | |114 | 185 | |114 |

So the `FIRST_VALUE`

function just searches for that first row (descendingly, i.e. bottom up) whose p1 value is non-null.

The same for p2:

SELECT length, rating, CASE WHEN p2 <= 0.5 THEN p2 END::numeric(3,2) p2, first_value(length) OVER ( PARTITION BY rating ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2 FROM ( SELECT rating, length, percent_rank() OVER (PARTITION BY rating ORDER BY length) p2 FROM film ) t ORDER BY rating, length;

Yielding:

length |rating |p2 |x2 | -------|-------|-----|----| 47 |G |0.00 |107 | 47 |G |0.00 |107 | 48 |G |0.01 |107 | 48 |G |0.01 |107 | ... 105 |G |0.47 |107 | 106 |G |0.49 |107 | 107 |G |0.49 |107 | 107 |G |0.49 |107 | <-- Last row in G partition whose 108 |G | |107 | PERCENT_RANK is <= 0.5 108 |G | |107 | 109 |G | |107 | ... 185 |G | |107 | 185 |G | |107 | 46 |PG |0.00 |113 | 47 |PG |0.01 |113 | 47 |PG |0.01 |113 | ... 111 |PG |0.49 |113 | 113 |PG |0.49 |113 | 113 |PG |0.49 |113 | <-- Last row in PG partition whose 114 |PG | |113 | PERCENT_RANK is <= 0.5 114 |PG | |113 | ...

Perfect! Notice if your RDBMS doesn’t support the `NULLS LAST`

clause in your `ORDER BY`

clause (e.g. MySQL), you might either hope that it defaults to sorting `NULLS LAST`

(MySQL does), or you can emulate it as such:

-- This ORDER BY x NULLS LAST -- Is the same as this ORDER BY CASE WHEN x IS NULL THEN 1 ELSE 0 END, x

**Emulating aggregate functions**

If you’re using SQL Server and want aggregate function behaviour, I recommend using the window function instead and emulate aggregation using `DISTINCT`

. It will probably be easier than the emulation below. Do check for performance though!

When you’re using e.g. MySQL, which doesn’t have inverse distribution function support at all, then this chapter is for you.

Here’s how to use the aggregate function version in Oracle:

-- Without GROUP BY SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x1 FROM film; -- With GROUP BY SELECT rating, percentile_disc(0.5) WITHIN GROUP (ORDER BY length) x2 FROM film GROUP BY rating ORDER BY rating;

Trivial! The result is the same as before:

X1 --- 114 RATING X2 ----------- G 107 NC-17 112 PG 113 PG-13 125 R 115

Now, let’s emulate these on e.g. MySQL, using window functions.

-- Without GROUP BY SELECT MAX(x1) x1 FROM ( SELECT first_value(length) OVER ( ORDER BY CASE WHEN p1 <= 0.5 THEN p1 END DESC NULLS LAST) x1 FROM ( SELECT length, percent_rank() OVER (ORDER BY length) p1 FROM film ) t ) t;

It’s exactly the same technique as before, except we now have to turn the window function behaviour (don’t group, preserve rows, repeat aggregation value on each row) back into aggregate function behaviour (group, collapse rows) by using an aggregate function, such as `MAX()`

. This is the same as what I did before with `DISTINCT`

, for illustration purposes.

-- With GROUP BY SELECT rating, MAX(x2) x2 FROM ( SELECT rating, first_value(length) OVER ( PARTITION BY rating ORDER BY CASE WHEN p2 <= 0.5 THEN p2 END DESC NULLS LAST) x2 FROM ( SELECT rating, length, percent_rank() OVER ( PARTITION BY rating ORDER BY length) p2 FROM film ) t ) t GROUP BY rating ORDER BY rating;

All we’re really doing (again) is translate the `GROUP BY`

expression to a `PARTITION BY`

expression in the window function, and then redo the previous exercise.

### Conclusion

Window functions are extremely powerful. They can be used and combined to calculate a variety of other aggregations. With the above approach, we can calculate the `PERCENTILE_DISC`

inverse distribution function, which is not readily available in most RDBMS using a more verbose but equally powerful approach that uses `PERCENT_RANK`

and `FIRST_VALUE`

in all RDBMS that support window functions. A similar exercise could be made with `PERCENTILE_CONT`

with a slightly more tricky approach to finding that `FIRST_VALUE`

, which I’ll leave as an exercise to the reader.

A future jOOQ version might emulate this for you, automatically.

Liked this article? You may also like 10 SQL Tricks That You Didn’t Think Were Possible.

Source link https://blog.jooq.org/2019/01/28/how-to-emulate-percentile_disc-in-mysql-and-other-rdbms/