Thursday, January 6, 2011

Tuning Mergesorts for better branch prediction

Yesterday, I decided to play around with a forecasting binary mergesort.

Some background: way back in about 1999, when I was first experimenting with ternary mergesorts, I had noticed that the best "division" of the list into sublists was not a 1:1:1 ratio. On the Pentium II that I had then, the best "division" (for sorting 32-bitintegers, anyway) seemed to be 3:4:5. I filed that fact away but didn't think about it again until yesterday.

It occurred to me that, during a merge, it's possible to rig the merge so that, in the main loop, the smaller of the two lists runs out first (simple enough: if the smaller list runs out last, find where the last element in the larger list would go; which makes it easy to determine the latest element of the smaller list that goes before the last element in the larger list). The point is that, if the smaller of the lists consists of a proportion p, 0<p<0.5, of the n elements being merged, then the number of boundary checks will be np. Furthermore the actual element comparisons will be more predictable, because, on average, the element from the smaller list will be be less with probability p. There would be, I reasoned, some value of p below 0.5 for which performance would be better, because, asymptoticially, the sort would require

moves(m): -1/(plog2(p) + (1-p)log2(1-p)) .n.log2(n)
comparisons: O(n) less
boundary checks: pm

My guess was that the best value of p (for sorting integers, anyway) would be about 0.4, with about a 2% performance improvement (because I thought that boundary checks in a non-forecasting "stock" mergesort contribute about 24% of the total running time). But it wasn't so. The best value of p appears to be about 0.1875, with a 5% performance improvement. That shocked me, because plugging p=0.1875 into the formulae above yields...

moves: 1.436 n.log2(n) (contrast: the coefficient is 1, for p=0.5)
checks: 0.287 n.log2(n) (the coefficient is 0.5, for p=0.5)

The only plausible idea I could come up with to explain that was: the lower p, the more predictable the results of the comparisons (asymptotically, p is the branch misprediction rate).

Unfortunately, the best value for p depends on the element type.

I also looked at asymmetric quicksorts (e.g. take the 2nd of 4 elements from a sample, deliberately choosing a pivot value that won't divide the input evenly, but if there's any advantage at all - which I doubt - it is very small). I'll post the (single-threaded) source codeand some performance plots, for the asymmetric mergesort, in my next post. I also want to have another go at asymmetric ternary mergesort (it seems to me that the ratio 1:2:4 ought to do well on average).

I haven't given any thought yet to whether there are similar tweaks available for shellsort and combsort (trading some extra "work" for better branch prediction).

No comments:

Post a Comment