Saturday, July 25, 2009

Zonesort - Yet another variant of mergesort

I've been on holiday in Thailand, so I haven't been posting. While I was in Thailand, in between looking at orchids, eating fantastic seafood, and getting short-changed everywhere I went, I worked a little on a variant of mergesort ("Zoned Mergesort") that requires a storage overhead of approximately

2 * sqrt( N.R.I.M.P )

where N is the number of records, R is the size of each record, I is the size of an integer, M is the order of merge, and P is the number of concurrent threads (mergesort overhead is usually something like N.R / 2 which is a lot more, for large enough N). It's an old idea of mine: I first wrote an implementation in 2004 (though only for M=2, P=1) but shelved it because it was considerably (typically 20%) slower than a standard mergesort. But it turned out that some minor rewrites to better exploit forecasting let it compete on even terms with the performance of standard mergesort, with P=1 and M=2. So I extended it to M=3, where it did even better. But I still haven't parallelized it properly (and that will take time, because it's a fairly complicated sorting routine).

Zoned Mergesort was derived, in a round-about way, from an idea of
Alexander Kronrod (see Knuth's Art of Programming, volume 3): merging dividing arrays of N records into approximately sqrt(N) "zones" of approximately sqrt(N) records. It works like ths:
  1. Divide the input array into X zones of Y records each (where X is approximately sqrt(N*I/R/M/P))

  2. Allocate storage for M*P*Y additional records.

  3. Allocate storage for (M*P+X) integers. These will be used to maintain linked lists of sorted zones.

  4. Sort each zone, using the additional zones as a scratch area, and put each zone's index into a one element-list

  5. Start P separate lists, each of M zones, chosen from the M*P additional zones (these are "available zone" lists

  6. Assemble linked lists of sorted zones by merging and relinking (I'll explain this step in more detail later)

  7. When you get to one linked list of zones, permute the order of the zones so that the entire array is in order

The time overhead for steps 6 and 7 turns out not to be a serious problem. The linked list manipulation in step 6 has O(X) overhead, and can be pretty much ignored when N is large. The permutation in step 7 is rather more expensive but is still O(N) on average. In practice, when P=1 at least, for large N, the permutation cost is outweighed by the improved cache hit rate on writes (almost all writes end up being cache hits). The cache hit rate is better than that of a standard mergesort due to how the zone list merging works (which I'll outline for M=2):
  1. Remove a destination zone from the current thread's list of available zones

  2. Merge records from the zone at the head of the left list, and the zone at the head of the right list, into the destination zone, until either the destination zone is exhausted, or the left zone is exhausted, or the right zone is exhausted

  3. If the destination zone is exhausted, add it to the tail of the output list, and remove a zone from the available zone list, to be the current destination zone

  4. If the left hand zone is exhausted, move on the left zone list, so that the next zone in the list becomes the left input zone, and put the just-exhausted zone into the available zone list

  5. As for step 4, but replacing "left" with "right"

  6. If neither the left zone list or the right zone list is exhausted, go back to step 2

  7. Copy records from the remaining zones in the input that hasn't been exhausted, until it runs out (allocating new destination zones and releasing exhausted input zones as above).

The cache benefit comes from the fact that, during most of the zone list merging phase, the current destination zone was almost always in use, recently, as an input zone. It turns out that you may have to read X.M-X+1 records before the first input zone is exhausted, which is why there need to be M "extra" zones in the initial available zone list for each thread.

There's some extra fiddling around to take account of the fact that the last zone will usually be smaller than the others (since N is unlikely to zero modulo Y), but nothing serious.

The big problem, though, is that parallelizing the last few zone list merges is painfully difficult. I've only got part of the way through it and I think it'll take me about six hours solid work to write even a clunky fully-parallel version. No fun at all!