SQL Server : Nested Loop vs Merge Join vs Hash Match

This article is half-done without your Comment! *** Please share your thoughts via Comment ***

When we request a new query the SQL Server Optimizer choose which logical Join implement, it can choose a different algorithm based on statistics, indexes, number of estimated rows, etc…

There are three differents Join operator: Nested Loop, Merge Join and Hash Match.

Nested Loop

This Join compares each row from the Outer table to each row from the Inner table looking for the rows which satisfy the Join predicate.

SELECT A.*, B.*
FROM Authors AS A
JOIN Books AS B
ON A.Id = B.Id
WHERE A.ReleaseDate Between '2015-01-01' and '2016-01-01'

The total number of rows compared and the cost of this algorithm is proportional to the size of the outer table multiplied by the size of the inner table.

The SQL Server Optimizer will prefer this operator when the outer input is small and the inner input has an index on the column used in the Join key.

The benefits of this operator compared to all others operator will be bigger when the difference in number of rows between the outer and the inner will be much bigger.

Merge Join

The Merge algorithm is the most efficient way to join between two very large sets of data which are both sorted on the join key.

SELECT A.*, B.*
FROM Authors as A
JOIN Books as B
ON A.Id = B.Id

The merge join requires that both inputs be sorted on the merge columns, for this reason the SQL Server Optimizer scan the index on the column of the key Join if exist or place a sort operator.

Merge Join can be an expensive choice if sort operations are required, but if the data volume is large and the desired data can be obtained presorted from existing indexes, merge join is often the fastest available Join algorithm.

The mechanism of this operator is very simple, it reads a row from each inputs and compares them using the Join key. If there’s a match, they are returned. Otherwise, the row with the smaller value can be discarded because both inputs are sorted, this repeats until one of the tables is completed.

In most cases the SQL Server Optimizer choose the Hash Join instead of Merge Join when the inputs are not both sorted on the Join key.

Hash Match

The SQL Server Optimizer chooses this operator when all others operator cannot be used for example when there are no indexes or when the tables are not properly sorted, etc…

In this query we haven’t index on Id column so the SQL Server Optimizer chooses the Hash Join:

Select A.*, B.*
From Authors as A
Join Books as B
On A.Id = B.Id
Where A.Name Like '%Fabio%'

For understanding this operator is important to know what is Hashing and Hash Table.
The first is a function which takes values and converts them to a single symbolic value. You can’t convert the symbolic value to its original value, but you have the garantee that the same value will always get the same symbolic value (different inputs values can have the same symbolic value).
The second element define a data structure that divides all rows in buckets of equal size and each buckets store a hash value, in most cases the performance of this type of structure are very fast.

The Hash Join has two inputs: the build input and probe input. The query optimizer assigns these roles so that the smaller of the two inputs is the build input because this operator can have a very costly execution plan when the tables are very large.

SQL Server Optimizer uses available statistics to figure out the smaller input. The selected input will be used to build the hash table in memory, but it there isn’t enough memory the hash table will be stored in TempDB on physical disk. After that, SQL Server get the data from the larger table, using the probe input, compare it to the hash table with hash match function and return the matched rows.

Conclusion

SQL Server Optimizer determines the best, optimized plan for executing the query.
The optimizer evaluate the indexes, number of rows, type of joins, etc…

Nested loop is generally chosen when table is significantly small and the larger table has an index on the join key. The complexity is O(NlogM).

Merge Join is considered to be more efficient for large tables with the keys columns sorted on the Join key and an equality operator is used. The complexity is O(N+M).

Hash Match is also better suited for large tables, it uses a hash table and hash match function to match rows. This requires less I/O, but need more CPU and requires lot of memory. The complexity is O(N*hc+M*hm+J).

Tips

In some case you can choose manually the Join operator by using the OPTION hint.
A good thing to do is update statistics to fix your indexes, this helps the optimizer to choose the right Join.

Thursday, September 17, 2020