Monday, June 1, 2009

Tuning Array Mergesorts for speed, part 1

I'll be writing three posts on tuning Mergesort for speed. This one, the first one, will focus on tuning two-way merging. The second post will focus on tuning three-way and four-way merging. The third post will focus on multithreading mergesort.

The received wisdom is that mergesort is usually significantly slower than quicksort on average, but faster than heapsort. That is true, comparing a single-threaded two-way mergesort with a standard quicksort. But the reason for the difference is that, although a two-way mergesort does fewer comparisons, on average, than a quicksort, a two-way mergesort requires far more record movement. Cache utilization has little to do with it. In later posts I'll provide example implementations of three- and four-way mergesorts, which require less record movement than a two-way mergesort and compete with quicksort on almost even terms, but this post covers the "low hanging fruit": techniques for tuning two-way mergesort (that also apply to three-way and four-way mergesort). These techniques will be reused in the three- and four-way mergesort routines I shall be discussing in later posts.

Technique#1: Store records in large contiguous blocks of memory, and use "Naked" pointers


If you want a mergesort to run fast, you will probably have to use naked pointers. Array indexing overhead is expensive, if the types you are sorting are otherwise cheap to compare and cheap to move. Using generalized container templates (like for example std::vector) is even more expensive than using arrays. If and when C++ compilers get a lot better, this won't matter, because the compilers will use "Operator Strength Reduction" to convert your array or std::vector code into naked pointer code. But don't hold your breath. All the sample routines shown here use naked pointers, because the compilers do not do a good enough job (yet!).

Technique#2: Delegating to other sorting routines for small sub-arrays.


For small sub-arrays, merging is not very efficient. Insertion sorting is faster. Many textbooks suggest using insertion sorting for arrays of size 10 or less (some even say 7). But don't take that number on faith. Do your own tests to see what the cut-off should be. For many types it is considerably higher. For sorting integers, for example, it is typically somewhere between 32 and 64. Other sorting routines can be used as delegates too. I find that using a heapsort as the delegate sort routine (with a cut-off of several hundred to several thousand, depending on the type and the compiler) often works at least as well. That said, a pointer + index version of Insertion sort might look like this (note, a two pointer version would probably be a little faster). The catch - if the delegated sort routine is not stable, the mergesort will not be stable either.

template <class T>void insertionSort(T *src, int count)
{
  //
  //Notes:  1.insertionSort is a "workhorse" sorting routine.  Many
  //      other sorting routines use it.  quickSort and Combsort
  //      may use it to "finish off the job".  Mergesorts may use it
  //      to create initial runs.
  //      2.insertionSort has a very low "set-up" cost, and for
  //      small enough N, is more efficient than quickSort.
  //      3.The following assumes that ((void*)src-sizeof(src[0]) >= 0),
  //          which may not be true if sizeof(src[0]) is large enough!
  //
  T v;
  T *stopAt  = src+count;  //pointer to location after last element to sort
  T *sorting;          //pointer to location after last element known
                //to be in sort order w.r.t. elements to its left
  T *scanning;        //used for finding where the element at *sorting
                //should be placed.

  for (sorting=src+1 ; sorting<stopAt; sorting++)
  {
    v=*sorting;
    for (scanning=sorting-1; src<=scanning && v<*scanning; scanning--)
    {
        scanning[1]=scanning[0];
    }
    scanning[1]=v;
  }
}

Technique#2: Alternating input and output arrays.


Some implementations of mergesort merge from the input array into the auxiliary array, and then copy the merged sub-arrays back to the input array: each element moves twice at each level of merging. That's a mistake. Elements should only move once at each level of merging. At each level of merging the input arrays and output arrays switch roles. Like so:

template<class T> void mergeRadix2External(T *a, T *aStop, T *b, T *bStop, T *dest)
{
  while (a<aStop && b<bStop)
  {
    if (*a<=*b)
    {
      *dest++ = *a++;
    }
    else
    {
      *dest++ = *b++;
    }
  }
  while (a<aStop) *dest++=*a++;
  while (b<bStop) *dest++=*b++;
}


template <class T> void mergesortRadix2External(T *src, int count, T *dest,
  bool bSrcIsInput, int cutOff=32)
{
  //
  //  Notes:  1.  implements a two-way merge on arrays in memory
  //          3.  At successive stack call depth levels  mergesortRadix2External()
  //              exchanges source and destination.  Imagine source on the left,
  //              destination on the right.
  //            
  //              [0,3]--> [4,7]-->      [8,11]--> [12,15]-->
  //          
  //                   <--[0,7]               <--[8,15]
  //
  //                             [0..15] -->
  //  On Entry:
  //              assumes that the memory addresses src[0..count-1] and 
  //              dest[0..count-1] do *not* overlap.
  //
  if (count<cutOff) 
  { 
    if (bSrcIsInput) 
    {
      for (int i=0; i<count; ++i)
        dest[i] = src[i];
    }
    insertionSort(dest, count); 
  }
  else
  {
    int middleIndex=count/2;  
    mergesortRadix2External(dest+middleIndex, count-middleIndex, src+middleIndex, !bSrcIsInput, cutOff);
    mergesortRadix2External(dest, middleIndex, src, !bSrcIsInput, cutOff);
    mergeRadix2External(src, src+middleIndex, src+middleIndex, src+count, dest);
  }
}

template<class T> void mergesortRadix2(T *a, int count,int cutOff)
{
  //
  //  Notes:  1.Sorts an array a[0..count=1]
  //          2.Requires additional storage in a2[0..count-1] the same
  //             size as the input array to be sorted.
  //
  T* a2 = new T[count];  
  mergesortRadix2External(a2, count, a, false, cutOff);
  delete [] a2;
}

Technique#3: Alternating between left-to-right and right-to-left merging.


If, at each level of recursion, the merging pattern switches between left-to-right and right-to-left merging, that will increase the cache hit rate slightly, because the first elements to be merged, from at least one of the inputs to each merging pass (above the lowest level of merging), should be in the cache. But does it really make a difference?

template<class T> void mergeBackwardsRadix2External(T *aStart, T* aStop, T *bStart, T *bStop, T *dest)
{
  T *a = aStop - 1;
  T *b = bStop - 1;
  dest += (aStop - aStart) + (bStop - bStart) - 1;
  while (aStart<=a && bStart<=b)
  {
    if (*a<=*b)
    {
      *dest-- = *b--;
    }
    else
    {
      *dest-- = *a--;
    }
  }
  while (aStart<=a) *dest-- = *a--;
  while (bStart<=b) *dest-- = *b--;
}

template <class T> void mergesortSwitchbackRadix2External(T *src, int count, T *dest,
  bool bSrcIsInput, int cutOff=32)
{
  //as per mergesortRadix2External, but the direction of merging is switched 
  //(between left-to-right and right-to-left) at each level of recursion,
  //to improve the cache hit rate.
  if (count<cutOff) 
  { 
    if (bSrcIsInput) 
    {
      for (int i=0; i<count; ++i)
        dest[i] = src[i];
    }
    insertionSort(dest, count); 
  }
  else
  {
    int middleIndex=count/2;  
    mergesortSwitchbackRadix2External(
      dest+middleIndex, count-middleIndex, 
      src+middleIndex, !bSrcIsInput, cutOff);
    mergesortSwitchbackRadix2External(
      dest, middleIndex, 
      src, !bSrcIsInput, cutOff);

    if (bSrcIsInput)
    {
      //left to right
      mergeRadix2External(src, src+middleIndex, src+middleIndex, src+count, dest);
    }
    else
    {
      //right to left
      mergeBackwardsRadix2External(src, src+middleIndex, src+middleIndex, src+count, dest);
    }
  }
}

template<class T> void mergesortSwitchbackRadix2(T *a, int count,int cutOff=32)
{
  //
  //  Notes:  1.Sorts an array a[0..count=1]
  //          2.Requires additional storage in a2[0..count-1] the same
  //             size as the input array to be sorted.
  //
  T* a2 = new T[count];  
  mergesortSwitchbackRadix2External(a2, count, a, false, cutOff);
  delete [] a2;
}

Yes (at least, it does on my home PC). The "switchback" version is about 1.5% faster. But that was a lot of code for a 1.5% performance gain!

Technique#4: Forecasting


When merging two sub-arrays, begin by comparing the *last* elements in each of the two sub-arrays, so that you can tell which of the inputs will be exhausted first. Once you know which input will be exhausted first, you don't ever have to check if the other input is exhausted at all. This saves half of the boundary checks. Like so...

template<class T> void mergeForecastRadix2External(T* a, T* aStop, T* b, T* bStop, T* dest)
{
  if (aStop[-1]<=bStop[-1])
  {
    //the "a"s will run out before the "b"s.
    for (;a<aStop;*dest++=*a++)
    {
      for (;*b<*a;*dest++=*b++);
    }
    for (;b<bStop;*dest++=*b++);
  }
  else
  {
    //the "b"s will run out before the "a"s.
    for (;b<bStop;*dest++=*b++)
    {
      for (;*a<=*b;*dest++=*a++);
    }
    for (;a<aStop;*dest++=*a++);
  }
}
template <class T> void mergesortForecastRadix2External(T *src, int count, T *dest,
  bool bSrcIsInput, int cutOff=32)
{
  if (count<cutOff) 
  { 
    if (bSrcIsInput) 
    {
      for (int i=0; i<count; ++i)
        dest[i] = src[i];
    }
    insertionSort(dest, count); 
  }
  else
  {
    int middleIndex=count/2;  
    mergesortForecastRadix2External(dest+middleIndex, count-middleIndex, src+middleIndex, !bSrcIsInput, cutOff);
    mergesortForecastRadix2External(dest, middleIndex, src, !bSrcIsInput, cutOff);
    mergeForecastRadix2External(src, src+middleIndex, src+middleIndex, src+count, dest);
  }
}

template<class T> void mergesortForecastRadix2(T *a, int count,int cutOff=32)
{
  //
  //  Notes:  1.Sorts an array a[0..count=1]
  //          2.Requires additional storage in a2[0..count-1] the same
  //             size as the input array to be sorted.
  //
  T* a2 = new T[count];  
  mergesortForecastRadix2External(a2, count, a, false, cutOff);
  delete [] a2;
}

For sorting integers, the forecasting mergesort is approximately 14% faster than a standard mergesort. For other types it provides less of a benefit. But... in this post you have already seen how to take mergesort from 27% slower than Quicksort to... 13% slower. In later posts I'll be showing how to speed it up further.

No comments:

Post a Comment