I stumbled upon a very interesting jOOQ question on Stack Overflow that required the calculation of a weighted average. Why is that.

### Problem description

Assuming you have this database (using PostgreSQL syntax):

create table transactions ( id bigint not null primary key, lines bigint not null, price numeric(18, 2) not null, profit numeric(18, 2) not null ); create table lines ( id bigint not null primary key, transaction_id bigint not null references transactions, total bigint not null, quantity bigint not null, profit numeric(18, 2) not null );

As can be seen, this schema is slightly denormalised as the number of lines per transaction are precalculated in the `transactions.lines`

column. This will turn out to be quite useful for this calculation, but it isn’t strictly necessary.

Now, in the previously linked Stack Overflow question, a report was desired that would calculate:

- An aggregation of sums as provided by the line items
- An aggregation of averages as provided by the transactions

This would be straightforward with two separate queries:

**Sums provided by the line items**

SELECT sum(profit) AS total_profit, sum(total) AS total_sales_amount, sum(quantity) AS total_items_sold FROM lines

**Averages provided by the transactions**

SELECT avg(lines) AS avg_items_p_trx, avg(price) AS avg_price_p_trx, avg(profit) AS avg_profit_p_trx FROM transactions

So far so good.

### Doing it in one query

Now, these queries are simplified from the original, which needed to join the two tables in order to add additional predicates. Also, let’s assume that these tables are quite large, so running two queries might lead to the report being too slow. A single query would be much better.

We might be attempted to simply combined the two:

-- Wrong query SELECT sum(l.profit) AS total_profit, sum(l.total) AS total_sales_amount, sum(l.quantity) AS total_items_sold, avg(t.lines) AS avg_items_p_trx, avg(t.price) AS avg_price_p_trx, avg(t.profit) AS avg_profit_p_trx FROM lines AS l JOIN transactions AS t ON t.id = l.transaction_id

But this query is wrong. While the sums are still correct, the averages are not, simply because the join produces duplicate transaction rows per lines. Imagine a transaction having 3 or 5 lines:

SELECT l.id AS line_id, t.id AS transaction_id, t.lines, t.price FROM lines AS l JOIN transactions AS t ON t.id = l.transaction_id

The output would be:

LINE_ID TRANSACTION_ID LINES PRICE ------------------------------------------- 1 1 3 20.00 2 1 3 20.00 3 1 3 20.00 4 2 5 100.00 4 2 5 100.00 4 2 5 100.00 4 2 5 100.00 4 2 5 100.00

- The average number of lines “avg_items_p_trx” should be 4 = (3 lines + 5 lines) / 2 transactions. But if we calculate
`avg(t.lines)`

over the entire data set, we get 4.25 (3×3 lines + 5×5 lines) / 8 items. - The average price “avg_price_p_trx” should be 60.00 = (20.00 + 100.00) / 2 transactions. But if we calculate
`avg(t.price)`

over the entire data set, we get 80.00 (3×20.00 + 5×100.00) / 8 items.

### How can this be fixed?

Given that each transaction is duplicated because of the join with lines, we have to calculate a weighted average, not an ordinary average. The idea is that instead of using the `AVG()`

aggregate function, we now have to divide the value we want to get an average of by the number of items (i.e. the number of times the value is repeated because of the join), and then divide the sum of that division by the number of transactions.

Prose never describes logic well, so let’s use code. The correct query is:

SELECT sum(l.profit) AS total_profit, sum(l.total) AS total_sales_amount, sum(l.quantity) AS total_items_sold, sum(t.lines / t.lines) / count(DISTINCT t.id) avg_items_p_trx, sum(t.price / t.lines) / count(DISTINCT t.id) avg_price_p_trx, sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx FROM lines AS l JOIN transactions AS t ON t.id = l.transaction_id

With the above data set:

LINE_ID TRANSACTION_ID LINES LINES/LINES PRICE PRICE/LINES ---------------------------------------------------------------- 1 1 3 1 20.00 6.66 2 1 3 1 20.00 6.66 3 1 3 1 20.00 6.66 4 2 5 1 100.00 20.00 4 2 5 1 100.00 20.00 4 2 5 1 100.00 20.00 4 2 5 1 100.00 20.00 4 2 5 1 100.00 20.00

We now get the correct weighted averages:

- The average number of lines “avg_items_p_trx” is now 4 =

(3/3 + 3/3 + 3/3 + 5/5 + 5/5 + 5/5 + 5/5 + 5/5) / distinct transactions - The average price “avg_price_p_trx” is now 60.00 =

(20.00/3 + 20.00/3 + 20.00/3 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5 + 100.00/5) / 2 distinct transactions

Note that “avg_items_p_trx” can be simplified:

SELECT sum(l.profit) AS total_profit, sum(l.total) AS total_sales_amount, sum(l.quantity) AS total_items_sold, count(*) / count(DISTINCT t.id) avg_items_p_trx, sum(t.price / t.lines) / count(DISTINCT t.id) avg_price_p_trx, sum(t.profit / t.lines) / count(DISTINCT t.id) avg_profit_p_trx FROM lines AS l JOIN transactions AS t ON t.id = l.transaction_id

Done!

### Normalised version

Notice that this solution profited from the fact that the number of lines per transaction was pre-calculated. We can of course also calculate it on the fly, e.g. using window functions. If it weren’t available, we could do it like this:

SELECT sum(l.profit) AS total_profit, sum(l.total) AS total_sales_amount, sum(l.quantity) AS total_items_sold, count(*) / count(DISTINCT t.id) avg_items_p_trx, sum(t.price / l.lines) / count(DISTINCT t.id) avg_price_p_trx, sum(t.profit / l.lines) / count(DISTINCT t.id) avg_profit_p_trx FROM ( SELECT l.*, count(*) OVER (PARTITION BY l.transaction_id) lines FROM lines AS l ) AS l JOIN transactions AS t ON t.id = l.transaction_id

Or, we turn the entire join into a 1:1 relationship by pre-aggregating all the data from lines into one row per transaction. This works because we only calculate sums from the lines table:

SELECT sum(l.profit_per_transaction) AS total_profit, sum(l.total_per_transaction) AS total_sales_amount, sum(l.quantity_per_transaction) AS total_items_sold, avg(l.lines_per_transaction) AS avg_items_p_trx, avg(t.price) AS avg_price_p_trx, avg(t.profit) AS avg_profit_p_trx FROM ( SELECT l.transaction_id sum(l.profit) AS profit_per_transaction, sum(l.total) AS total_per_transaction, sum(l.quantity) AS quantity_per_transaction, count(*) AS lines_per_transaction FROM lines AS l GROUP BY l.transaction_id ) AS l JOIN transactions AS t ON t.id = l.transaction_id

Source link https://blog.jooq.org/2019/03/15/calculating-weighted-averages-when-joining-tables-in-sql/