Monday, January 17, 2011

C++ mergesortAsymmetric source code

Here's my current draft of mergesortAsymmetric(). But I'm suspicious there's something wrong with it somewhere, because when I tuned it for space it became suspiciously fast. So far, it seems a little faster, on average, than quickSortClassic(), for sorting integers. I don't see how that can be right! (It does more element moves and more comparisons than quicksort or standard binary mergesort - and I haven't fixed up either of the low-level sort routines to use pre-increment instead of post-increment). There must be a mistake buried in there somewhere, but I can't see what it is.

template <class T>void insertionSort(T *src, int count)
{
  //
  //Notes:  1.insertionSort is a "workhorse" sorting routine.  Many
  //      other sorting routines use it.  quickSort and Combsort
  //      may use it to "finish off the job".  Mergesorts may use it
  //      to create initial runs.
  //      2.insertionSort has a very low "set-up" cost, and for
  //      small enough N, is more efficient than quickSort.
  //      3.The following assumes that ((void*)src-sizeof(src[0]) >= 0),
  //          which may not be true if sizeof(src[0]) is large enough!
  //
  T v;
  T *stopAt  = src+count;  //pointer to location after last element to sort
  T *sorting;          //pointer to location after last element known
                //to be in sort order w.r.t. elements to its left
  T *scanning;        //used for finding where the element at *sorting
                //should be placed.

  for (sorting=src+1 ; sorting<stopAt; sorting++)
  {
    v=*sorting;
    for (scanning=sorting-1; src<=scanning && v<*scanning; scanning--)
    {
        scanning[1]=scanning[0];
    }
    scanning[1]=v;
  }
}

template <class T> int insertionSortExternal(T *input, int count, T *output)
{
  //note: it is assumed that there is no overlap between output and input
  //    (or in C notation, that this...
  //      !(input<=output && output<input+count) &&
  //      !(output<=input && input<output+count)
  //    ...holds true).
  //
  T *endOutput = output+count;
  T *sortedBoundary;
  T *scanBack;
  *output = *input++;        //copy over first element
  for (sortedBoundary=output+1;sortedBoundary<endOutput;sortedBoundary++)
  {
    for (scanBack=sortedBoundary-1;scanBack>=output;scanBack--)
      if (*scanBack<=*input) breakelse scanBack[1]=*scanBack;
    scanBack[1]=*input++;
  }
  return 0;
}

template <class T> inline int indexOfFirstElementInBlockGreaterThan(T *block, int count, T &v)
{
  int loIndex = 0;
  int hiIndex = count;
  while (loIndex<hiIndex)
  {
    int tryIndex = ( loIndex + hiIndex ) / 2;
    if ( v < block[tryIndex]  )
      hiIndex  = tryIndex;
    else
      loIndex = tryIndex + 1;
  }
  return hiIndex;
}

//Date        By Change
//=========== == ======
//04-Jan-2010 JB Draft
//11-Jan-2010 JB Revisions to AsymmetricMergesorter, to reduce the space requirement 
//               from NR  (N=record count, R=record size) 
//               to   NPR (where P=ratio of smaller
//                         sublist to list size during each merge).
//
//               There are two sort routines, sortRight and sortLeft.
//               sortRight: 1. calls itself to sort the right-hand (n-np) records,
//                             back to the same location
//                          2. calls sortLeft to sort the left-hand np records into
//                             the np-record workarea
//                          3. merges the left-hand data from the workarea, with
//                             the right-hand data.
//               sortLeft:  1. sorts left hand of destination to source
//                          2. sorts right hand of destination to source
//                          3. merges left and right source to destination                          
//
//               There's an unexpected benefit: performance is considerably *better*
//               

template<class T> void mergeAsymmetricRadix2External(T* a, T* aStop, T* b, T* bStop, T* dest)
{
  //Assumes:  aStop-a is considerably less than bStop-b
  //Note:     We want to merge until we run out of elements in a.
  //          But, if elements in b will run out first, we can't do that safely.
  //

  if (aStop[-1]<=bStop[-1])//the easy case.  a runs out first
  {
    for (;a<aStop;++dest,++a)
    {
      for (;*b<*a;++dest,++b)
      {
        *dest=*b;
      }
      *dest=*a;
    }
    for (;b<bStop;++dest,++b)
    {
      *dest=*b;
    }
  }
  else //the harder case: b's will run out first.  but we to check for a's running out *first*
  {
    //
    //find a portion of the a's, left(a) that *will* run out first,
    //and merge left(a) and b until left(a) runs out.
    //
    T* aChop = a + indexOfFirstElementInBlockGreaterThan(a, aStop-a, bStop[-1]);
    for (;a<aChop;++dest,++a)
    {
      for (;*b<*a;++dest,++b)
      {
        *dest=*b;
      }
      *dest=*a;
    }
    for (;b<bStop;++dest,++b)
    {
      for (;*a<=*b;++dest,++a)
      {
        *dest=*a;
      }
      *dest = *b;
    }
    for (;a<aStop;++dest,++a)
    {
      *dest=*a;
    }
  }
}

template <class T> class AsymmetricMergesorter
{
private:
  T*  m_base;
  T*  m_workArea;
  int m_workCount;
  int m_count;
  int m_cutOff;
  int m_numerator;
  int m_denominator;
public:
  AsymmetricMergesorter(T* base, int count, int cutOff, int numerator, int denominator)
    : m_base(base)
    , m_count(count)
    , m_cutOff(cutOff)
    , m_numerator(numerator)
    , m_denominator(denominator)
  {
    m_workCount = (long)(long)m_count * m_numerator / m_denominator ;
    m_workArea  = new T [ m_workCount ]; //NPR space
  }
  ~AsymmetricMergesorter()
  {
    delete [] m_workArea;
  }
  void sortLeft(T *source, int count, T* dest, bool bSourceIsInput)
  {
    if (count<m_cutOff)
    {
      if (bSourceIsInput)
        insertionSortExternal(source, count, dest);
      else
        insertionSort(dest, count);
    }
    else
    {
      int leftCount = ( (long)(long)count * m_numerator / m_denominator );
      sortLeft( dest,             leftCount,       source,         !bSourceIsInput );
      sortLeft( source+leftCount, count-leftCount, dest+leftCount,  bSourceIsInput);
      mergeAsymmetricRadix2External( source, source+leftCount, dest+leftCount, dest+count, dest);
    }
  }
  void sortRight(T* source, int count)
  {
    if (count<m_cutOff)
    {  
      insertionSort(source, count);
    }
    else
    {
      int leftCount = ( (long)(long)count * m_numerator / m_denominator );    
      sortRight(source+leftCount, count-leftCount);
      sortLeft (source,           leftCount,        m_workArea,  true);
      mergeAsymmetricRadix2External(m_workArea, m_workArea+leftCount, source+leftCount, source+count, source);
    }
  }
  void sort()
  {
    sortRight(m_base, m_count);
  }
};

template<class T> void mergesortAsymmetricRadix2(T *a, int count,int cutOff=32,int numerator=1, int denominator=4)
{
  AsymmetricMergesorter<T> s(a, count, cutOff, numerator, denominator);
  s.sort();
}

Thursday, January 6, 2011

Tuning Mergesorts for better branch prediction

Yesterday, I decided to play around with a forecasting binary mergesort.

Some background: way back in about 1999, when I was first experimenting with ternary mergesorts, I had noticed that the best "division" of the list into sublists was not a 1:1:1 ratio. On the Pentium II that I had then, the best "division" (for sorting 32-bitintegers, anyway) seemed to be 3:4:5. I filed that fact away but didn't think about it again until yesterday.

It occurred to me that, during a merge, it's possible to rig the merge so that, in the main loop, the smaller of the two lists runs out first (simple enough: if the smaller list runs out last, find where the last element in the larger list would go; which makes it easy to determine the latest element of the smaller list that goes before the last element in the larger list). The point is that, if the smaller of the lists consists of a proportion p, 0<p<0.5, of the n elements being merged, then the number of boundary checks will be np. Furthermore the actual element comparisons will be more predictable, because, on average, the element from the smaller list will be be less with probability p. There would be, I reasoned, some value of p below 0.5 for which performance would be better, because, asymptoticially, the sort would require

moves(m): -1/(plog2(p) + (1-p)log2(1-p)) .n.log2(n)
comparisons: O(n) less
boundary checks: pm

My guess was that the best value of p (for sorting integers, anyway) would be about 0.4, with about a 2% performance improvement (because I thought that boundary checks in a non-forecasting "stock" mergesort contribute about 24% of the total running time). But it wasn't so. The best value of p appears to be about 0.1875, with a 5% performance improvement. That shocked me, because plugging p=0.1875 into the formulae above yields...

moves: 1.436 n.log2(n) (contrast: the coefficient is 1, for p=0.5)
checks: 0.287 n.log2(n) (the coefficient is 0.5, for p=0.5)

The only plausible idea I could come up with to explain that was: the lower p, the more predictable the results of the comparisons (asymptotically, p is the branch misprediction rate).

Unfortunately, the best value for p depends on the element type.

I also looked at asymmetric quicksorts (e.g. take the 2nd of 4 elements from a sample, deliberately choosing a pivot value that won't divide the input evenly, but if there's any advantage at all - which I doubt - it is very small). I'll post the (single-threaded) source codeand some performance plots, for the asymmetric mergesort, in my next post. I also want to have another go at asymmetric ternary mergesort (it seems to me that the ratio 1:2:4 ought to do well on average).

I haven't given any thought yet to whether there are similar tweaks available for shellsort and combsort (trading some extra "work" for better branch prediction).

Sunday, January 2, 2011

Sorting with less data movement than selection sort

Until recently, the wikipedia article on selection sort claimed it required less data movement than any other sorting algorithm. Luckily, that seems to have been just-about corrected; there's now an article (http://en.wikipedia.org/wiki/Cycle_sort) outlining a cycle sort algorithm that does "far fewer" array writes ("far fewer" is what Wikipedia says; it's about half as many, on average). I say, "just about corrected", because the cycle sort outlined there is still doing too many element moves (about as many as selection sort). The lines that go...

array[pos], item = item, array[pos]

still require three moves (because that's an exchange). If, however, you've got two item variables, item1 and item2, you can instead go...

array[pos], item2 = item1, array[pos]

half the time, and

array[pos], item1 = item2, array[pos]

the other half of the time, switching the sense of item1 and item2 each time (there's a very similar hack available in heapsort that reduces the number of index assignments). That change will reduce the average number of element moves by 1/3, to 2(n-f), where n is the number of elements in the array, and f is the number of elements already in their final position.

But that's still too many element moves. Applying a random permutation to n elements in place should require n+c-2*f element moves, where c is the number of cycles (including one-element cycles) - H(n) on average - and f is the number of elements already placed correctly (the number of one-element cycles), which yields: minimum 0, average n+H(n)-1, maximum 3n/2 moves. Even the "two item" version of cycle sort still requires on average almost twice as much data movement as it should.

I haven't been able to figure out an algorithm with a constant space overhead that achieves that. Maybe there isn't one. But I can see how to get closer to it, by maintaining more index variables. If you've got i index variables, the length of the current cycle can be reduced by i-1, at the cost of i element moves. The problem with that is handling duplicate values (it's bad enough in cycle sort but if you've got more index variables it gets much worse, because you have to skip over duplicates that will be moved into place when the cycle is "unraveled", but haven't actually been moved into place yet).

If the output isn't in the same location as the input, then n element moves will be sufficient (using a comparison counting sort, which will also be stable).