Monday, June 8, 2009

Tuning mergesorts for speed, part 2

But, back to tuning mergesort... In a three-way mergesort, the input is divided into three parts, each of those parts are sorted separately, by recursively calling the mergesort, and then the three sorted parts are merged.

The big advantage is that there is less record movement. Just as in a standard 2-way mergesort, each record is copied once at each level of recursion, but there are fewer levels of recursion. Instead of log(n)/log(2) there are log(n)/log(3) levels of recursion, and since log(2)/log(3) is (roughly) 0.63, this means that there is (roughly) 37% less record movement.

But there's a price.

During the merging phase, in the worst case, it may take two comparisons to decide from which part the next record will come, so the number of comparisons is 2*n*log(n)/log(3), rather than n*log(n)/log(2): in the worst case a three-way mergesort requires approximately 26% more record comparisons than a two-way mergesort.

The average case is better, though. If we know the ordering of the first element in each of the three parts being merged, and we move the one that comes first, we "still know" the ordering of the elements from the other two parts, and, assuming the parts have elements drawn from the same distribution, we need slightly fewer (5/6) comparisons on average. There is a 1/3 chance that the next element to go to the output will come from the same place the *last* one did, and in that case we only need one comparison. 1/3 * 1 + 2/3 * 2 = 5/3. So, if we keep track of the ordering of the "next" element to be fetched from each part, we need, on average, 5/3*n*log(n)/log(3) comparisons: about 5% more than a two-way mergesort.

In a language that has GOTO statements (as C++ does!) the easiest place to store the information (about relative order of the next element from each partition) is in, of all places, the program counter.

Warning! From here on the code is ugly, and gets uglier.

template <class T>void mergeRadix3External(T *a,T *aStop,T *b,T *bStop,T *c,T *cStop
                         , T *w)
{
  if (*a<=*b)      
    if (*b<=*c)    
      goto xABC;
    else if (*a<=*c) 
      goto ACB;
    else 
      goto CAB;  
  else
    if (*a<=*c)
      goto BAC;
    else if (*b<=*c)
      goto BCA;

CBA:
  do
  {
    *w++=*c++; if (c==cStop) { mergeRadix2External(a, aStop, b, bStop, w); return; }
  } while (*c<*b);
  if (*c<*a) goto BCA;

BAC:
  do
  {
    *w++=*b++; if (b==bStop) return mergeRadix2External(a, aStop, c, cStop, w);
  } while (*b<*a);
  if (*b<=*c) goto xABC;
ACB:
  do
  {
    *w++=*a++; if (a==aStop) { mergeRadix2External(b, bStop, c, cStop, w); return; }
  } while (*a<=*c);
  if (*a<*b) goto CAB; else goto CBA;

BCA:
  do
  {
    *w++=*b++; if (b==bStop) { mergeRadix2External(a, aStop, c, cStop, w); return; }
  } while (*b<=*c);
  if (*b<*a) goto CBA;
CAB:
  do
  {
    *w++=*c++; if (c==cStop) { mergeRadix2External(a, aStop, b, bStop, w); return; }
  } while (*c<*a);
  if (*c<*b) goto ACB;
xABC:
  do
  {
    *w++=*a++; if (a==aStop) { mergeRadix2External(b, bStop, c, cStop, w); return; }
  } while (*a<=*b);  //B A?C
  if (*a<=*c) goto BAC; else goto BCA;
}

template <class T>void mergesortRadix3External(T *src, int count, T *dest,void (*delegatedSort)(T *,int)
                            ,int cutOver, bool bSrcIsInput)
{
  if (count<cutOver) { if (bSrcIsInput) jcopy(src,count,dest); return delegatedSort(dest, count); }
  int listSize = count/3;
  T *a=src;  T *aStop=src+listSize; 
  T *b=aStop; T *bStop=b+listSize;
  T *c=bStop; T *cStop=src+count;
  T *w=dest;

  bSrcIsInput = !bSrcIsInput;
  mergesortRadix3External( dest+(listSize<<1), count-(listSize<<1), src+(listSize<<1), delegatedSort, cutOver, bSrcIsInput);  
  mergesortRadix3External( dest+listSize,      listSize,        src+listSize,    delegatedSort, cutOver, bSrcIsInput);
  mergesortRadix3External( dest,         listSize,        src,        delegatedSort, cutOver, bSrcIsInput);
  mergeRadix3External(a,aStop,b,bStop,c,cStop,w);
}

template <class T> void mergesortRadix3Delegate(T *a, int count,void (*delegatedSort)(T *,int)=NULL
                        ,int cutOver=32)
{
  if (delegatedSort==NULL) delegatedSort=insertionSort<T>;
  T* a2 = new T[count];  //allocate a copy
  mergesortRadix3External(a2, count, a, delegatedSort, cutOver, false);
  jfree(a2);
}

template <class T> void mergesortRadix3(T *src, int count)
{
  mergesortRadix3Delegate(src, count, insertionSort<T>, 32);
}


Using classic quicksort's running time as the reference, we have (for sorting 20 million integers on my home PC):
Classic quicksort's running time = 1
Standard Mergesort's running time = 1.30
Forecasting two-way Mergesort's running time = 1.13
Three-way Mergesort's running time = 1.06

Strangely enough, you can get three-way mergesort to run a little faster if you split the current sub-array into sub-arrays that are of unequal size. The simplest way to demonstrate that, with the routine shown above, is to change the way listSize is calculated. Just setting listSize=count/4 results in a slight speed up (lowering the running time to about 1.03). In theory, tweaking the the merge routine slightly, to "prefer" comparing A and B with each other, rather than with C, should help too, but in practice it doesn't seem to make much, if any, difference.

Already Quicksort's advantage is starting to look decidely shaky, and we haven't used forecasting or examined four-way merging yet. The catch, though, is introducing forecasting to three-way mergesort is going to triple the code size, switching direction will double its size again, on top of that, and going over to four-way mergesort will increase the code size by another factor of (about) four. This post is too long already, and presenting routines like that will make it obscenely long (the forecasting, direction-switching four-way mergesort is almost 3000 - three thousand lines long). In the next post, (part 2A) I'll present the code generator I actually wrote that puppy with. And, surely, nobody in their right mind would code a tuned 5-way mergesort by hand!

What surprised me was that, with forecasting, a 3-way mergesort outperformed a 4-way mergesort. I won't bore you with a graph showing... just about straight lines. The timings are (if Quicksort's time is the reference time of 1), for sorting integers, with N between 250,000 and 50,000,000:

RoutineTime
2-way mergesort1.28
2-way forecast1.11
3-way mergesort1.04
3-way forecast+switch0.99
4-way forecast+switch1.01
5-way forecast+switch1.16


So... for sorting integers, on my Pentium Core Quad machine at least, a single-threaded Mergesort can give a single-threaded Quicksort a run for its money. If both routines are single-threaded, a well-tuned 3-way mergesort can even beat Quicksort sorting integers (but it will lose out to the version of Introsort I mentioned in my last post).

In my next post (part 2A) on tuning mergesort for speed I'll provide the source code for a mergesort code generator. Two hundred lines of code generator is awful, but it beats showing the 2800-odd lines of code for the 4-way merge routine, let alone the 22 thousand for a 5-way merge routine!

No comments:

Post a Comment