Thursday, August 20, 2009

Zonesort mostly parallelized, but... it's a pyrrhic victory

Well... I have finally parallelized Zonesort (just about, I still need to do some extra work to deal with degenerate - that is, almost sorted or almost reverse-sorted input - cases properly). But there's a problem. It needs better load balancing. And I can't see a good way to do that. A single gantt chart illustrates the symptoms neatly.

This is a graph of CPU utilization (color, red is 100%, white is 0%) by time (across) by worker thread number (down), for a four thread Zonesort using 3-way merging (except for merging the last four zone lists), while it is sorting 9,395,322 distinct integers in about 0.32 seconds.

I've finally parallelized the last few merge operations (though I had to increase the space requirements by a factor of sqrt((M+1)/M) to achieve that), as well as the zone shuffling operation, but there's a load balancing problem earlier on in the sort, making - in this case, which is typical - the overall running time about 6% worse than it should be. If I could fix that it would still be considerably slower (by at least 10%) than a stock-standard 3-way mergesort on the same input. So I don't think I'll bother.

The other sorting routines I've ever parallelized never had significant per thread storage requirements, the way this one seems to. I can't share available zone lists between threads easily (and, no, serializing access to a global available zone list doesn't strike me as a good idea at all; that would mean altogether too much synchronization), so for a P-threaded sort I'm dividing the input into P pieces and sorting each piece independently. But some cores - and some threads - run faster than others, and my heroic (for which, read: stupid) assumption, that those P independent sorts would each run in about the same time, is clearly wrong.

I'm already using "barriers" in the last few merges and for the final zone shuffle (the time wasted by those barriers is visible in the gantt chart, and a barrier always means an idle thread), but at least I'm not doing any explicit thread control in the sort, and I'm not throwing away much extra memory (an important consideration when the whole point of this sorting routine is to save space compared to a standard merge sort) to co-ordinate threads running in parallel (so far I've been using up only about 200*P bytes on per-thread storage).

I'll post the parallel zoneSort once I've ironed out the bugs that stop it working for nearly-in-order or nearly-reverse-order inputs.