Thursday, June 25, 2009

Heapsort - Floyd's Trick - part 2 - Sorting Strings

An aside: I would have posted this one sooner, but I was running into an unpleasant problem due to a bug in a do-it-myself string class. Yes, the world has too many string classes - it probably already had one too many when the second one was written, perhaps even when the first was written - and mine was as buggy, or more so, than most. I know that writing my own string classes is stupid but I loathe std::string. It's soooo bloated, and with methods, not free functions. But still, that doesn't justify my writing my own string classes. Thinking it does, even temporarily, is a good example of the tu qoque phallacy. I have repented now! But the string class is already written. Someday, I'll compound my sin by posting some string classes too.

First, some preliminary comments about sorting strings. In later posts I shall be writing some posts about sort algorithms tailored specifically for sorting strings (e.g. Burstsort, in its various flavours, and B-Tree sorts), where prefix-extraction, tries, and the like, will come into play. But for now, the focus will be on Heapsorts.

Two-way mergesort is generally considered to be the reference record-comparison array sort for sorting strings. That's because, when sorting pointers to strings (which is often what happens in C++ and also, under the hood, in languages like Java), comparisons tend to be relatively expensive compared to record copies or exchanges. And fair enough too. String pointers may also be a good proxy for pointers to some more complex types and classes (when multiple members of those types/classes are being used as the sort key).

Asymptotically, a two-way top-down Mergesort with a randomly ordered input where all the values are distinct, requires N*log(N)/log(2) comparisons, on average (and the best case is closer to half that).

When sorting string pointers, with a single-threaded record comparison sort, the second most important consideration is minimizing the number of comparisons. Minimizing the number of cache misses is actually rather more important, and later on that consideration will lead to some bizarre sorting algorithms. But not yet.

I covered how Floyd's trick is actually implemented in an earlier post. As I remarked there, it reduces the asymptotic average number of comparisons required for a radix M heapsort from M.log(N)/log(M) to (M-1).log(N)/log(M). Using those formulae, and assuming that comparisons are what count, and using the performance, X, of mergesort as the reference, we can try to predict sort routine performance:

  • For Radix-2 Heapsort with Floyd's trick: 1.0*X

  • For classic Quicksort: 1/2/ln(2) ~= 0.72*X

  • For Radix-2 Heapsort without Floyd's trick: 0.5*X

  • For Radix-4 Heapsort with Floyd's trick: 2/3 ~= 0.67*X

  • For Radix-4 Heapsort without Floyd's trick: 0.5*X



But it ain't so!





As you can see, from the two graphs, merely counting comparisons doesn't lead to a good prediction for the running time. Counting comparisons and cache misses works better. Even rough approximations of the cache miss count lead to reasonably good predictions for the running time.

  • For most of the comparisons that take place during a mergesort, one of the records being compared was compared immediately before-hand. The access pattern in the array itself is predictable, and, asymptotically, half of the string look-ups will be cache hits.

  • Quicksort likewise (though Quicksort does even better than I would have
    expected, and I don't have any explanation for that yet)

  • String look-ups near enough to the top of the heap are cache hits. When the element being sifted down into the heap is being compared "on the way down" both string lookups are cache hits. Given that cache hits are what count, these comparisons are much cheaper than the others, low in the heap. That's the reason that Floyd's trick doesn't improve the performance of radix-2 heapsort much. On the other hand, the access pattern in the array itself is not predictable, so array element look-ups also result in cache misses.

No comments:

Post a Comment