Fast inequality joins4 minute read
There is a new cool feature in Presto that will speed up some inequality joins.
Inequality join conditions are all conjunctions that are not equality comparisons. For example in query like this:
SELECT * FROM t1, t2 WHERE t1.bucket = t2.bucket AND t1.val1 < t2.val2;
This t1.val1 < t2.val2 is an inequality join condition that we will be talking about. In Presto such joins previously would be executed as normal inner joins with only equality conditions (using hash tables) and result of such joins would be filtered out using inequality join conditions. Lets take a look at an extreme example where this is clearly an issue and define tables t1 and t2 using Presto Memory connector as follows:
CREATE TABLE memory.default.t1 (bucket bigint, val1 bigint); CREATE TABLE memory.default.t2 (bucket bigint, val2 bigint);
and fill those tables with some random data:
INSERT INTO t1 SELECT orderkey % 10000, (orderkey * 13) % 1000 FROM tpch.sf1.lineitem; INSERT INTO t2 SELECT orderkey % 10000, (orderkey * 379) % 10 FROM tpch.sf1.lineitem;
Basically bucket will be in range from 0 to 10000, val1 in range 0 to 1000, val2 in range 0 to 10 and both tables will have 6,001,215 rows. Now what is going on is that following query:
SELECT count(*) FROM t1, t2 WHERE t1.bucket = t2.bucket;
Has to produce 720 million of rows. After adding t1.val1 < t2.val2 most of them must be filtered out so query:
SELECT count(*) FROM t1, t2 WHERE t1.bucket = t2.bucket AND t1.val1 < t2.val2;
returns only 29,120,090. The issue here is that for each row from the t1 table, there are around 600 rows in t2 that have same bucket value. Before Presto 0.174, all of those 600 rows had to be traversed and each row had to be filtered one by one. This was a huge inefficiency.
Since version 0.174 Presto has an option that addresses this particular issue. There is a session property called fast_inequality_join (enabled by default). When enabled, Presto will use t2.val2 value to sort all rows from t2 table that fall into the same bucket. Thanks to that, when there is an incoming row from t1 table, Presto can reach into matching bucket and instead of iterating over all 600 rows, perform a binary search for t1.val1 value.
For example if we have following rows in t2 table:
bucket | val2 --------+------ 42 | 0 42 | 5 42 | 2 42 | 9 42 | 7 42 | 8 42 | 1 42 | 4 42 | 3 42 | 6
and we want to match row from table t1:
bucket | val1 --------+------ 42 | 7
Previously Presto had to iterate over t2.val2 values:
0, 5, 2, 9, 7, 8, 1, 4, 3, 6
and filter out each on a case by case basis for matching to the predicate t1.val1 < t2.val2. Now Presto will sort them:
0, 1, 2, 3, 4, 5, 6, 7, 8, 9
quickly perform binary search of value 7 and iterate only over remaining matching values (7, 8, 9). This reduces time complexity of such joins from:
O(size(t1) * bucket_size)
O(size(t1) * (log(bucket_size) + number_of_matching_rows_in_bucket))
Where bucket_size is average number of rows from t2 table that have same t2.bucket value. When number_of_matching_rows_in_bucket is much smaller compared to bucket_size, we can simplify this to:
O(size(t1) * log(bucket_size))
This of course has tremendous effect on query times. Optimization enabled:
presto:default> set session fast_inequality_join=true; SET SESSION presto:default> SELECT count(*) FROM t1, t2 WHERE t1.bucket = t2.bucket AND t1.val1 < t2.val2; _col0 ---------- 29120090 (1 row) Query 20170530_085758_00130_asrts, FINISHED, 8 nodes Splits: 849 total, 849 done (100.00%) 0:01 [12M rows, 206MB] [13.9M rows/s, 238MB/s]
presto:default> set session fast_inequality_join=false; SET SESSION presto:default> SELECT count(*) FROM t1, t2 WHERE t1.bucket = t2.bucket AND t1.val1 < t2.val2; _col0 ---------- 29120090 (1 row) Query 20170530_072421_00115_asrts, FINISHED, 7 nodes Splits: 801 total, 801 done (100.00%) 0:08 [12M rows, 206MB] [1.51M rows/s, 26MB/s]
As we can see query completed 8 times faster with this optimization enabled! Of course speed up largely depends on your query shape and your data. The more selective is your inequality join condition the better case for this optimization.
Currently code in Presto that looks for the expression on which right table should be sorted is somehow limited and can resolve only simple inequality join conditions. Since Presto 0.177 you can check whether this optimization was triggered in the explain output:
presto:tiny> explain SELECT o.orderkey, o.orderdate, l.shipdate FROM orders o JOIN lineitem l ON l.orderkey = o.orderkey AND l.shipdate < o.orderdate + INTERVAL '10' DAY; - Output[orderkey, orderdate, shipdate] => [orderkey:bigint, orderdate:date, shipdate:date] - RemoteExchange[GATHER] => orderkey:bigint, orderdate:date, shipdate:date - InnerJoin[("orderkey" = "orderkey_0") AND ("expr_3" > "shipdate")] => [orderkey:bigint, orderdate:date, shipdate:date] SortExpression["shipdate"] - ScanProject[table = tpch:tpch:orders:sf0.01, originalConstraint = true] => [expr_3:date, orderkey:bigint, orderdate:date] expr_3 := ("orderdate" + "$literal$interval day to second"(BIGINT '864000000')) orderkey := tpch:orderkey orderdate := tpch:orderdate - LocalExchange[HASH] ("orderkey_0") => orderkey_0:bigint, shipdate:date - RemoteExchange[REPARTITION] => orderkey_0:bigint, shipdate:date - TableScan[tpch:tpch:lineitem:sf0.01, originalConstraint = true] => [orderkey_0:bigint, shipdate:date] orderkey_0 := tpch:orderkey shipdate := tpch:shipdate
As you can see, in the above example shipdate was chosen as a sorting expression. If Presto fail to determine sorting expression it will be also missing in explain output. Sometimes you might have to change your condition a bit. Although expressions l.shipdate - INTERVAL '5' DAY< o.orderdate - INTERVAL '5' DAY and l.shipdate < o.orderdate + INTERVAL '10' DAY are equivalent, the first one will currently not work with this optimization.
Next time when you stumble across query with inequality join condition keep in mind about this new feature.