Poor performance in multi-threaded C++ program

If you have plenty of CPU cores and have plenty of work to do, it should not take longer to run in multi-threaded than single threaded mode – the actual CPU time may be a fraction long, but the “wall-clock time” should be shorter. I’m pretty sure that your code has some sort of bottleneck where one thread is blocking the other.

This is because of one or more of these things – I’ll list them first, then go into detail below:

  • Some lock in a thread is blocking the second thread from running.
  • Sharing of data between threads (either true or “false” sharing).
  • Cache thrashing.
  • Competition for some external resource causing thrashing and/or blocking.
  • The Badly designed code in general.

Some lock in a thread is blocking the second thread from running.

If there is a thread that takes a lock, and another thread wants to use the resource that is locked by this thread, it will have to wait. This obviously means the thread isn’t doing anything useful. Locks should be kept to a minimum by only taking the lock for a short period. Using some code to identify if locks are holding your code, such as:

Printing some stats of the locks would help to identify where the locking is contentious – or you can try the old trick of “Press break in the debugger and see where you are” – if a thread is constantly waiting for some lock, then that’s what’s preventing progress…
Sharing of data between threads (either true or “false” sharing)

If two threads use [and update the value of it frequently] the same variable, then the two threads will have to swap “I’ve updated this” messages, and the CPU’s have to fetch the data from the other CPU before it can continue with it’s use of the variable. Since “data” is shared on a “per cache-line” level, and a cache-line is typically 32-bytes, something like:

would classify as something called “false sharing” – the ACTUAL data updated is unique per CPU, but since the data is within the same 32-byte region, the cores will still have updated the same are of memory.

Cache thrashing.

If two threads do a lot of memory reading and writing, the cache of the CPU may be constantly throwing away good data to fill it with data for the other thread. There are some techniques available to ensure that two threads don’t run in “lockstep” on which part of cache the CPU uses. If the data is 2^n (power of two) and fairly large (a multiple of the cache-size), it’s a good idea to “add an offset” for each thread – for example 1KB or 2KB. That way, when the second thread reads the same distance into the data region, it will not overwrite exactly the same area of cache that the first thread is currently using.

Competition for some external resource causing thrashing and/or blocking.

If two threads are reading or writing from/to the hard-disk, network card, or some other shared resource, this can lead to one thread blocking another thread, which in turn means lower performance. It is also possible that the code detects different threads and does some extra flushing to ensure that data is written in the correct order or similar, before starting work with the other thread.

It is also possible that there are locks internally in the code that deals with the resource (user-mode library or kernel mode drivers) that block when more than one thread is using the same resource.

Generally bad design

This is a “catchall” for “lots of other things that can be wrong”. If the result from one calculation in one thread is needed to progress the other, obviously, not a lot of work can be done in that thread.

Too small a work-unit, so all the time is spent starting and stopping the thread, and not enough work is being done. Say for example that you dole out small numbers to be “calculate if this is a prime” to each thread, one number at a time, it will probably take a lot longer to give the number to the thread than the calculation of “is this actually a prime-number” – the solution is to give a set of numbers (perhaps 10, 20, 32, 64 or such) to each thread, and then report back the result for the whole lot in one go.

There are plenty of other “bad design”. Without understanding your code it’s quite hard to say for sure.

It is entirely possible that your problem is none of the ones I’ve mentioned here, but most likely it is one of these. Hopefully this asnwer is helpful to identify the cause.