Tips for writing more efficient SQL
07 Feb 2020Learned from colleagues some points to pay attention to when writing SQL queries. (This post will be updated from time to time.)
Partitioned tables
Always specify the partition in the where
clause. If you have to retrieve data from several partitions, loop through it one-by-one.
Distinct elements
Note that the two queries
SELECT DISTINCT(item) FROM table_X
and
SELECT item FROM table_X
GROUP BY item
give the same result. However, the second SQL query will be executed faster. There is no difference calling distinct
and group by
via (py)spark though.
JOIN v.s. WHERE
Always use where
to filter the table to be joined to the smallest, e.g.
SELECT c.* FROM credit c
INNER JOIN (
SELECT date, item FROM purchase
WHERE date > 20190207
) p
ON c.date = p.date
WHERE c.eligibility = True
Note that the line WHERE c.eligibility = True
is executed to filter the table credit
before the JOIN
. This shrinks the table credit
to the smallest before joining.