Sunday, June 21, 2009

Ramblings on Multithreaded Quicksorts

Hey, there's competition! See:
http://broadcast.oreilly.com/2009/06/may-column-multithreaded-algor.html

The examples there are written around pthreads (a couple of years back I had all my multithreading classes working with pthreads - not boost::threads - on Linux, but I haven't had to work with Linux since then, and multiple-monitor and graphic card support - I've got two HD3870x2 cards on my desktop - was still awful the last time I tried to set up Ubuntu on it, so I'm not fooling around in Linux until and unless I have to).

The bit there about generalizing to more threads leads to an approach roughly equivalent to a variant of my example #3. The variant is to limit the number of concurrent worker threads (when I tried that out on about Wednesday of last week (17-June-2009), I used explicit InterlockedIncrementAcquire and InterlockedDecrementRelease calls, which is considerably cheaper than a Mutex-protected integer, but the idea was much the same). Using a gateway mutex to serialize access to an integer works too, and will be fine, so long as you don't expect it to scale down. See the purple line in the following graph. A version using critical sections would have been slower, and a version using mutexes slower yet (had I set it up that way I think the purple line would have plumetted at about the same place the light-blue line did).



Actually, looking at the code I put in the last post, the version of Example #3 I posted was doing that, because I derived it from NotSoDumbThread, for which I had (oops!) a hard-coded limit on the number of not-so-dumb threads actually executing concurrently. I had the limit set to 16 but should have set it to... 4. I think the runs that generated the graph shown in this post used 4, rather than 16. I forget.

I wonder what the limit is, on the speed-up you can get by multithreading (integer) Quicksort routines (well, for my four core machine, anyway). I think the limit is higher than 3.5:1, but I don't see how to get there. It seems pretty clear to me that synchronization overhead doesn't need to present a serious problem (for large N), and so long as the number of threads running at a time is equal to the number of cores, contention and starvation aren't problems either. But 3.5:1 still seems lower than it ought to be.

The dispatcher model I presented in my last post has other advantages (and disadvantages) that I didn't talk about in my last post. Apart from the obvious benefit that it gives significantly better performance than (most) approaches that create new threads as the sort proceeds, it has other advantages (compared to creating new threads on the fly):

  • It is relatively easy to "free up" some cores, simply by allocating fewer threads, and by constraining which cores those threads run on. I'll provide some examples showing how to do that, on Windows at least, down the track.

  • You can run several sorts at the same time, using the same dispatcher and worker threads. Contention and starvation are still a bother, but less so.


There are disadvantages, though:

  • It's harder to control the space overhead of the sort. Approaches that create new threads can control the space overhead if they limit the number of new threads (which in practice they have to do anyway, so you get that "free"). But when there's a shared action queue, limiting the number of actions put into it for a specific sort run is a bit more painful. It is doable though: the limit can be imposed in the latch implementations; e.g. by subclassing DivideAndConquer and implementing a version of queueOrDo like this (code to set the (CTR) m_actionLimit member in the class constructor has been omitted)...

    void DivideAndConquerSubClass::queueOrDo(IAction *a)
    {
      if (increment()<m_actionLimit)
      {
        gpDispatcher->queueOrDo(NewNotifyingAction(a, this));
      }
      else
      {
        a->run();
        notify();
      }
      decrement();
    }



No comments:

Post a Comment