Friday, June 26, 2009

Daily "It Ain't So" - wikipiedia's Mergesort article

For some reason blogger has decided this was published 29-Jun-2009. Which ain't so. It's actually 03-Jan-2011 as I type this!

Today's It-Ain't-So is for wikipedia's mergesort article (as it is today, not as it might be down the track).
  • A "mergesort" using lists is not an in-place sort. Either the pointer is part of the record - in which case the list mergesort isn't sorting elements, it's modifying them, or the pointer is not part of the record, in which case mergesort isn't in-place.
    [that's how the ancient Greeks would have argued it, and they would have been right]

  • It is not quite true that "mergesorting" lists is, in performance terms, no better than heapsort. It is true, if swapping values between nodes is not allowed (for a randomly-ordered input, as the sort proceeds the order of the addresses becomes increasingly jumbled up, leading to cache misses). But, if swapping values between nodes is allowed, and the input was initially of in-address-order nodes adjacent in memory, it's possible (and not very difficult) to rewrite the nodes in the sorted sublists after every, say, 8 levels of merging, avoiding the cache-miss problem, and beating heapsort comfortably, for large n. I've got some code for doing that buried somewhere (it's about ten years old and I haven't used it in about eight years, and I don't remember if it "cheated" and used additional storage for the rewrite. Maybe it did).

  • If you can use "link fields" for mergesort (and it's still a mergesort?) then you can also use them for quicksort (and hence, quicksort can be stable).

  • Mergesorting linked lists does not require an O(1) space overhead: it requires O(logn) space to record (at each of the O(logn) merging levels) which node is the head of the first sorted sublist, while the second sublist at the same level is being sorted, unless swapping values between nodes is allowed, or adjacent sorted sublists are "taped together" and you're happy to spend O(nlogn) time finding the start of the right hand sublist before each merge begins. I've never seen either of those techniques used; all the linked list mergesorts I've ever seen had an plog2(n) space overhead (on top of the np space overhead for the linkage fields), where p is the size of a link.

  • While it's true that heapsort won't work on linked lists, there is a close analogue of heapsort that works on priority trees. As I understand it, heapsort was originally developed from a priority-tree algorithm.

  • While it's technically true to say that a self-balancing binary search tree is better than an on-line mergesort, when "the received pieces are small compared to the sorted list", there's an invalid assumption wired into that statement. Why would there be one sorted list, rather than ~O(logm) sorted sublists, where m is the number of elements that have been "fed" to the on-line sort routine so far, with the (i+1t)h sublist at most half the length of the ith? That's the explicit stack used in a straight mergesort co-routine.
And now for the discussion page that hangs off the article...
  • Where I said that Quicksort's real advantage over Mergesort comes down to the amount of record movement (July 2005), I was, well, wrong. I should have qualified that statement a lot more. I should have said that, a Quicksort with a two-handed partitioning subroutine will beat a (binary) Mergesort (even one tuned to do less boundary checks than Quicksort) if and only if element moves are significantly more expensive than element comparisons.
  • Asturneresquire (August 2007) is wrong: Quicksort does not make significantly better use of the cache than Mergesort. It might at first seem that mergesort will put more of a load on the cache because it is accessing more memory, but it isn't that simple: mergesort is accessing each address fewer times on average. The cache (on, say, an I7) easily keeps up with both Mergesort and Quicksort. It isn't even "cracking a sweat".

No comments:

Post a Comment