Wednesday, June 10, 2009

Heapsort - Floyd's Trick - part 1 - Sorting Integers

This is the first of two posts on Floyd's variant of heapsort. This one covers sorting integers (and other "cache-friendly" types). The second will cover the sorting of pointers to strings.

There's a trick for writing versions of Heapsort (due to Floyd), that require fewer comparisons. The idea is to skip comparisons of the value being reinserted into the heap "on the way down" - to assume it will go to the bottom of the heap - and then to shuffle it back towards the top of the heap. First you find the path to the bottom and only then decide where in the path the value being reinserted should go.

The trade-off sounds good: you save a touch less than n.log(n)/log(r) comparisons (where r is the radix of the heap, 2 for a standard heap, 3 for a ternary heap and so on) for the price of O(n) extra moves. Once upon a time, things were that simple, and it was worth using Floyd's version of sift-down (during the selection phase, but not during the heap creation phase).

But it isn't that simple any more. Floyd's version results in more

  • cache misses (since you always go to the bottom of the heap even when you don't really have to)

  • branch mispredictions (if you're sorting a simple type, that is easy and cheap to compare, and your compiler's optimisation is good, the standard version only cops about one branch mis-prediction per sift-down, but Floyd's version cops two)



The net result is that, depending on the type you are sorting, and how many records you are sorting, using Floyd's version of the sift-down routine during the selection phase might actually slow down the sort slightly, instead of speeding it up. Yeah, wikipedia says that it generally gives a 25% improvement in performance, but I suspect that's a misquote of something I said. If I remember what I said correctly, and chances are I don't, I said "...a 25% improvement in performance, for a radix-2 heapsort, if, on average, a move take the same amount of time as a comparison...".

Speaking of comments of mine that have been distorted on that page, the one about ternary heapsort being about 12% faster... has also "lost it's context". I probably said that something more like this:

  • if Floyd's version of the sift-down routine is not being used, a ternary heapsort requires 3/2/log(3)*log(2) ~ about 94.6% of the comparisons needed by a binary heapsort, and log(2)/log(3) ~ about 63% of the record movement.

  • if Floyd's version of the sift-down routine is being used, a ternary heapsort requires 2/1/log(3)*log(2) - about 126% of the comparisons needed by a binary heapsort, and log(2)/log(3) - about 63% of the record movement

  • in practice, sorting integers, on a Pentium 2, Floyd's version of ternary heapsort yields about an 11% improvement in performance compared to Floyd's version of binary heapsort



In the second paragraph of this post I said that Floyd's trick should - usually - not be used during the heap creation phase (try it out and see! It usually slows things down!). The reason it is often a bad idea to use Floyd's trick during the heap creation phase is that using it makes a "hidden" assumption about the distribution of the value that is being sifted down into the subheap: it is assumed that the distribution of the value being sifted down is approximately the same as that of the values already at the bottom of the heap. But during the heap construction phase, that is not true, because the values already at the bottom of the heap are there by virtue of the results of more comparisons than have yet been made for the value being sifted down. So: the value being sifted down is rather less likely to belong at the bottom of the heap than is the case during the selection phase.

Here are the results of running four different heap implementations running on randomly-permuted arrays of distinct integers (cheap to compare, and cheap to move).



It's worth pointing out that a radix-4 heapsort should not be significantly slower for integers than a radix-2 heapsort, even for small N. The compiler is letting the source code down. I say this with confidence because of what I've learned hand-crafting assembler heapsort implementations. I suck at assembler (I suck bad) but I can still beat the compiler for radix-2 heapsort and I can slaughter it for radix-4. For sorting integers, a radix-4 version should be only slightly slower, even for small N.

Asymptotically (ignoring the small linear and sub-linear terms), a radix P heapsort requires N*log(N)/log(P)*(P-F) comparisons and N*log(N)/log(P) moves, where F is 0 without Floyd's trick and 1 with it. Using those formulae, to compare a radix-2 and radix-4 heapsort, suggests that (without Floyd's trick), radix-4 heapsort should require approximately the same number of comparisons and about half as many moves as radix-2 heapsort. With Floyd's trick, radix-4 heapsort should, compared to radix-2 heapsort, require approximately 50% more comparisons, but 50% fewer mvoes.

For large enough N, on my Core 2 Quad machine, a radix-4 heapsort sorts integers about twice as fast as a radix-2 heapsort, because it suffers... half as many cache misses.

When sorting pointers to strings (and other cache-unfriendly types) it is a very different story, and Floyd's trick really does improve Heapsort performance markedly. But that's for a later post.

Here's the code for the routines being compared in the graph:

template<class T> void heapSortTopless(T *a, int count)
{
  int i; 
  int j; 
  T v;
  for (int h=count/2; h>=0; h--)  
  {
    v=a[i=h];
    j = i + i + 2;  //standard heapsort has +1 instead of + 2
    while (j<count)
    {
      if (j<count-1) 
        j+= (a[j]<a[j+1]) ? 1 : 0;
      if (a[j]<=v) 
        break;
      a[i]=a[j];
      i = j;
      j = i + i + 2;  //standard heapsort has +1 instead of +2
    }
    a[i]=v;
  }
  for (--count;count>=0;count--)
  {
    v=a[count];  //standard heapsort must also set a[count]=a[0]  
    i = count;  //standard heaposrt sets i=0
    j = 0;      //standard heapsort sets j=1

    while (j<count)
    {
      if (j<count-1) 
        j+= (a[j]<a[j+1]) ? 1 : 0;
      if (a[j]<=v) 
        break;
      a[i]=a[j];
      i = j;
      j = i + i + 2;  //standard heapsort has +1 instead of +2
    }
    a[i] = v;
  }
}

template<class T> void heapSort2Floyd(T* a, int count)
{
  int i; 
  int j; 
  T v;
  for (int h=count/2; h>=0; h--)  
  {
    v=a[i=h];
    j = i + i + 2;  
    while (j<count)
    {
      if (j<count-1) 
        j+= (a[j]<a[j+1]) ? 1 : 0;
      if (a[j]<=v) 
        break;
      a[i]=a[j];
      i = j;
      j = i + i + 2;  
    }
    a[i]=v;
  }
  for (--count;count>=0;--count)
  {
    v=a[count];
    i = count;
    j = 0;

    while (j<count)
    {
      if (j<count-1) 
        j+= (a[j]<a[j+1]) ? 1 : 0;
      //we do *not* check if (a[j]<=v) and allow an early exit here
      //so we save N.log2(N) comparisons.  But see below.
      a[i]=a[j];
      i = j;
      j = i + i + 2;  //standard heapsort has +1 instead of +2
    }

    //
    //because we didn't check a[j]<=v on the way down, we need to
    //go back up, checking if v<=a[j].  On average, over the entire sort
    //this loop runs O(N) times, adding O(N) comparisons and O(N) moves.
    //

    //BEGIN extra chunk of code due to Floyd
    while (i>1)
    {
      j = (i-2)>>1;
      if (v<=a[j])
        break;
      a[i]=a[j];
      i = j;
    }
    //FINISH extra chunk of code due to Floyd

    a [ i ] = v;
  }
}


template <class T> void heapSort4(T *base, int count)
{
  int i;
  T   v;
  int j;
  int j0;

  for (int i0=count/4;i0>=0;--i0)
  {
    i = i0;
    v = base[i];
    j = i + i + i + i + 4;
    while (j<count)
    {
      j0 = j;
      if (j+3<count) //almost always
      {
        j = (base[j]<base[j0+1]) ? (j0+1) : j;
        j = (base[j]<base[j0+2]) ? (j0+2) : j;
        j = (base[j]<base[j0+3]) ? (j0+3) : j;
      }
      else //very rarely
      {
        j = ((j0+1<count) && (base[j]<base[j0+1])) ? (j0+1) : j;
        j = ((j0+2<count) && (base[j]<base[j0+2])) ? (j0+2) : j;
        j = ((j0+3<count) && (base[j]<base[j0+3])) ? (j0+3) : j;
      }
      if (base[j]<v)
        break;
      base[i] = base[j];
      i = j;
      j = j+j+j+j+4;
    }
    base[i] = v;      
  }

  for (--count;count>=1;--count)
  {
    int i=count;
    T   v = base[count];
    int j=0;
    while (j<count)
    {
      int j0 = j;
      if (j+3<count) //almost always
      {
        j = (base[j]<base[j0+1]) ? (j0+1) : j;
        j = (base[j]<base[j0+2]) ? (j0+2) : j;
        j = (base[j]<base[j0+3]) ? (j0+3) : j;
      }
      else //very rarely
      {
        j = ((j0+1<count) && (base[j]<base[j0+1])) ? (j0+1) : j;
        j = ((j0+2<count) && (base[j]<base[j0+2])) ? (j0+2) : j;
        j = ((j0+3<count) && (base[j]<base[j0+3])) ? (j0+3) : j;
      }
      if (base[j]<v)
        break;
      base[i] = base[j];
      i = j;
      j = j+j+j+j+4;
    }
    base[i] = v;      
  }
}


template <class T> void heapSort4Floyd(T *base, int count)
{
  int i;
  T   v;
  int j;
  int j0;

  for (int i0=count/4;i0>=0;--i0)
  {
    i = i0;
    v = base[i];
    j = i + i + i + i + 4;
    while (j<count)
    {
      j0 = j;
      if (j+3<count) //almost always
      {
        j = (base[j]<base[j0+1]) ? (j0+1) : j;
        j = (base[j]<base[j0+2]) ? (j0+2) : j;
        j = (base[j]<base[j0+3]) ? (j0+3) : j;
      }
      else //very rarely
      {
        j = ((j0+1<count) && (base[j]<base[j0+1])) ? (j0+1) : j;
        j = ((j0+2<count) && (base[j]<base[j0+2])) ? (j0+2) : j;
        j = ((j0+3<count) && (base[j]<base[j0+3])) ? (j0+3) : j;
      }
      if (base[j]<v)
        break;
      base[i] = base[j];
      i = j;
      j = j+j+j+j+4;
    }
    base[i] = v;      
  }

  for (--count;count>0;--count)
  {
    int i=count;
    T   v = base[count];
    int j=0;
    while (j<count)
    {
      int j0 = j;
      if (j+3<count)
      {
        j = (base[j]<base[j0+1]) ? (j0+1) : j;
        j = (base[j]<base[j0+2]) ? (j0+2) : j;
        j = (base[j]<base[j0+3]) ? (j0+3) : j;
      }
      else
      {
        j = ((j0+1<count) && (base[j]<base[j0+1])) ? (j0+1) : j;
        j = ((j0+2<count) && (base[j]<base[j0+2])) ? (j0+2) : j;
        j = ((j0+3<count) && (base[j]<base[j0+3])) ? (j0+3) : j;
      }
      //We do *not* check if (base[j]<v) and allow an early exit here
      base[i] = base[j];
      i = j;
      j = j+j+j+j+4;
    }

    //BEGIN extra chunk of code due to Floyd
    while (i>3)
    {
      j = (i-4)>>2;
      if (v<=base[j])
        break;
      base[i]=base[j];
      i = j;
    }
    //FINISH extra chunk of code due to Floyd

    base[i] = v;      
  }
}

1 comment: