Sunday, June 7, 2009

The Cache Didn't Do It (the real reasons Quicksort usually outperforms Mergesort in practice)

I keep reading statements like this... "When Quicksort is well implemented it will run faster on a Modern desktop than Mergesort because of the processor's cache." But it isn't that simple.

When I posted a comment on the Wiki Heapsort discussion page back in 2005 saying that Heapsort had serious problems for large N, on modern desktops, because it didn't work well with modern cache architectures I never thought that that line of argument (which is valid for heapsort) would be extended to the comparison between Quicksort and Mergesort (where cache effects aren't t usually so important).

There's an easy way to tell whether caching is an important factor. Run the routines, time them, and plot a line graph. Graph: X = log2(N), Y = N * log2(N) / RunningTime.

If cache effects are important, you will be able to see it in the graph. The line plotted for a routine with a cache problem will visibly "dip" sharply when N*R crosses a cache-size boundary. Heapsort does. Mergesort doesn't. Here's a graph graphing N * log2(N) / RunningTimeInSeconds / 1000000 against N. Each datapoint is the result of a single execution of the sorting routine in question (which is why there "wiggles" in the lines plotted here). Radix 2, 3 and 4 heapsorts are shown, as is a stock-standard Best-of-3-quicksort and a stock-standard 2-way mergesort.


There are other factors that give quicksort an advantage over mergesort in practice.

  • A two-way mergesort requires considerably more record movement that quicksort (record movement takes time, even if all memory accesses are cache hits). A mergesort requires ~1*N.log2(N) record copies, whereas a best-of-3 quicksort requires ~0.24*N.log2(N) record exchanges.

  • Typical implementations of mergesort require more boundary checks. A well-tuned quicksort needs approximately the same number of boundary checks as record exchanges (0.24*N*log2(N)). Mergesorts often require N*log2(N).


But both of these factors can be addressed.

  • Use three-way or four-way merging. In practice, 3-way merging will do; it reduces the average number of moves by a bit less than 37%, though the worst case number of comparisons is about 26% worse than for a two-way mergesort). If the worst case is important, use 4-way merging, where the number of moves is half that of a two-way mergesort but the number of comparisons is the approximately the same as for a two-way mergesort).

  • Use forecasting (for each merge, determine which of the inputs being merged will run out first. If you know that, boundary checks need only be performed when moving records from the input that will run out first.


Yes, the cache does matter. And yes, it does give Quicksort an edge over most flavours of Mergesort. On, say, my four-core machine, the cache advantage of quicksort starts to be significant, once you are carrying out the sort using two or more processors. But it's just one factor, and it less important than either record movement or boundary checking overhead.

No comments:

Post a Comment