Archived

This forum has been archived. Please start a new discussion on GitHub.

Overhead of Ice mutexes.

I have a method

void SenderImpl::consume(Widget w)
{
iceMutex.lock();

// Computationally expensive operation in critical secition.

iceMutex.unlock();
}

When invoking consume from a client application a single time with a large Widget, cpu utilization is around 25% (dual procesor system) on both the client and server, which is what I would expect in this case. I have a producer-consumer system that is computationally symmetric and serialized. It takes as long for the client to produce the Widget as it takes for the server to consume it. I would expect on a single processor system that cpu utiliazation would be around 50%. Here is a code snippet.

void ClientImpl::produce()
{
// Computationally expnsive operation to produce Widget.
Widget *w = new Widget();

senderProxy->consume(w);
}
}

I tried subdividing the producing of the Widget into several smaller widgets (around thirty), and invoke consume() for each smaller widget as it is produced. This was an attempt to remove some of the serialization. Unfortunately, the cpu utilization triples on the server, and performance stays around the same. Here is a code snippet.

void ClientImpl::produce()
{
for(i = 0; i<numWidgets; i++)
{
// Computationally expnsive operation to produce Widget.
Widget *w = new Widget(size);

senderProxy->consume(w);
}
}


What I'm trying to do is get as much of the producing and consuming operating in parallel. I was expecting performance and cpu utilization to almost double.

I haven't spent a whole lot of time looking into this yet, so I'm wondering what the overhead of Ice::Mutex is when waiting. In this case I have potentially thirty consume() disaptch threads waiting on a single mutex.

I've done this in a bigger system, so the example listed above has been simplified. I hope I didn't miss anything so I'll probably write an easy demo program later. OS is Windows

Comments

  • marc
    marc Florida
    In general, waiting mutexes do not consume any CPU resources while they are waiting.

    I need to better understand what your application is doing:
    • Are the client and server collocated, i.e., in the same process?
    • If not collocated, do you run them on the same host (a dual-processor machine)?
    • If not collocated, how many threads do you have in the server-side thread pool?
    • Are the calls to consume() oneway or twoway?

    As an aside, I don't recommend to call lock() and unlock() directly. You should use one of the Lock classes instead, which lock in the constructor and unlock the the destructor. This is safer, especially with respect to exceptions.
  • Hi Marc, Thanks for the response. I suspected that the mutexes shouldn't use much CPU. Could be an issue in my implementation. Time to write a small demo program.

    The client and server in this case are on two separate dual cpu systems. The client produces the Widgets and the server consumes them.

    The server side thread pool is set to around 40.

    The invocations to consume are oneway and I've also tried tagging the consume method as ["ami"], but this did not change the results.

    I've also tried running the code without the mutexes, and cpu utilization and performance goes up, but the results are incorrect. I can't be sure what is going on in this case.

    --Roland
  • marc
    marc Florida
    The server serializes all requests, because of the mutex. So there is no way to bring up the CPU utilization by adding more threads.

    All the 40 threads do, is to create lots of overhead for thread context switching. That is, 40 threads receive requests in parallel (meaning lots of thread context switching), just to wait on the mutex in consume() for serialization.

    The waiting on the mutex does not consume any resources, but what a thread does before it reaches the mutex, i.e., receiving a request, does consume CPU resources.

    In order to improve performance, you must somehow de-serialize the consume function, for example, by using more fine-grained mutex protection. If this is not possible, then there is simply no way to utilize more than one CPU.

    Also, you must reduce the number of threads, otherwise your overhead for thread context switching is too high. You can do this by lowering the number of threads in the thread pool, or by making sure that your client does not send a huge number of oneway requests in a burst.

    Tagging a method with AMI is not necessary for oneway requests. Oneway requests are implicitly asynchronous, because there is no response at all.
  • Hi Marc, The first way I'm attempting to improve the cpu utilization is to get as much of the producing of Widgets on the client side occuring in parallel with the consuming on the server side.

    If the client produces one large Widget and then invokes senderProxy->consume(widget), then the most I can get on a uniprocessor system is 50% cpu utilization and 25% on a dual processor system. This assumes that the producing and comsuming are computationally symmetric and that the client and server systems have the same performance. This is exactly what I'm seeing in this case. The client does not continue producing more large Widgets until the server has completed. CPU utilization is around 25%,

    So I have sub-divided the computation of the production of Widgets on the client side. In this case the client produces several smaller Widgets. Let's call them SubWidgets even thought they are the same as Widgets. In this case the client produces several SubWidgets as fast as it can and each SubWidget is sent to the server using consume(subWidget). The client has to wait until the last SubWidget is processed by the server before it can start to produce the next Widget or set of SubWidgets. In this case I would expect to see close to 100% cpu utilization on a uniprocessor system.

    On a dual processor system I would expect 50% cpu utilization and twice the performance compared to the single Widget case. I agree, the mutex serializes all consme() processing on the server side so there should be only one active thread at any point in time. Unfortunately, what I'm seeing is 75% cpu utilization and the same performance. Which gets back to my main question. In this case there should just be a bunch of consum() threads waiting on the mutex to free up.

    I agree with you that de-serializaing the consume() method should improve performance even further by enabling more than one active thread and more than one cpu to being workong. This is what I'm planning on doing next, but for now I want to try and understand root cause regarding why I can't get 2X the performance with 50% cpu utilization.

    I don't quite understand your comments regarding that 40 threads creates lot's of overhead for thread context switching and that I should reduce the number of threads. I was thinking that all the threads would be sleeping or waiting to be signalled when the mutex is unlocked so there would be very low if any cpu utilization by waiting threads.

    Perhaps my original question should be re-stated. What is the overhead of a dispatch thread waiting on a Ice mutex? If there is overhead for threads to wait on a mutex then this would easily explain my issue.

    I could try lowering the server thread pool size to 1, and get rid of the mutex to validate this.
  • marc
    marc Florida
    Again, there is *zero* overhead for a thread that waits on a mutex. Only locking and unlocking a mutex consumes CPU cycles, both because of the inherent thread context switches, and because of the locking/unlocking overhead in the mutex.

    However, the thread is doing something *before* it reaches this mutex. It is receiving the requests, unmarshaling parameters, preparing to dispatch the request, etc. All this will happen in parallel when multiple requests arrive at the same time (or nearly the same time).

    So 40 threads fight for the CPU cycles to do unmarshaling and preparing dispatching the request. But then they are finished with this, they hit the mutex, and all they can do is to wait until it's their turn.

    So while the 40 threads (or some of them) are receiving requests, there is a lot of overhead due to thread context switching and locking internal shared resources. Later, when the requests are serialized, there is additional thread context switching, because each of the 40 threads has to get its turn.

    I would suggest 2-3 threads on the server. Everything else will not lead to additional performance gain, but rather to slowdown, because there are only 2 CPUs.

    For the client, sending oneways as fast as possible doesn't help either. Oneway requests may block when the TCP send buffer is full. So let's say you send 80 SubWidgets. The server will stop reading requests from the client after 40 of these, because all 40 threads are eventually waiting on the mutex. This means that the oneways will block, because the send buffer is full, and the server doesn't read until at least one of the consume functions return.

    Even if the server wouldn't stop reading widgets (if the server were very fast), it might still be possible that the client blocks, simply because the TCP/IP stack can't send the SubWidgets quickly enough, and thus the TCP/IP buffer fills up, and sending blocks.

    So to summarize:
    • Waiting on mutexes does not consume any CPU cycles. Locking and unlocking mutexes does.
    • Receiving a request, unmarshaling parameters, and preparing a request does consume CPU cycles. If there is a lot of thread context switching in this phase, then performance goes down.
    • The speed with which the client can send oneways is limited by (1) how fast the server can process these requests, and (2) how fast your TCP/IP stack can send out requests.

    Hope this helps to clarify what's going on!
  • Hi Marc, Thanks for all your comments. The approach that we were using above turned out to work just fine. We managed to improve performance 2X to 3.5X and cpu utilization is scaling as we expect.

    We had a twoway invocation in another part of the system, not in the example listed above, that was severly limiting performance. Not sure we completely understand the interaction ourselves yet, but converting to a oneway resolved the problem.