bug in IceInternal.Outgoing in Ice 3.2.1

dhogan · February 2009

This morning we experienced a situation in which ~50 threads for one of our IceBox-hosted servants were hung calling another servant hosted in IceBox in a gridded-deployment (no shared Communicator, so calls from one servant to the other hit the network). All of the hung threads had the same bottom of the stack trace as follows:

java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
IceInternal.Outgoing.invoke(Outgoing.java:137)
com.hp.halo.HaloICE._DatabaseServicesDelM.acquireLock(Unknown Source)
com.hp.halo.HaloICE.DatabaseServicesPrxHelper.acquireLock(Unknown Source)
com.hp.halo.HaloICE.DatabaseServicesPrxHelper.acquireLock(Unknown Source)
com.hp.halo.eventservice.EventServicesImpl2.acquireLock(EventServicesImpl2.java:1197)
com.hp.halo.eventservice.EventServicesImpl2.terminateEvent(EventServicesImpl2.java:623)
com.hp.halo.HaloICE._EventServicesDisp.___terminateEvent(Unknown Source)
com.hp.halo.HaloICE._EventServicesDisp.__dispatch(Unknown Source)
IceInternal.Incoming.invoke(Incoming.java:147)
Ice.ConnectionI.invokeAll(ConnectionI.java:2249)
Ice.ConnectionI.message(ConnectionI.java:1362)
IceInternal.ThreadPool.run(ThreadPool.java:782)
IceInternal.ThreadPool.access$100(ThreadPool.java:12)
IceInternal.ThreadPool$EventHandlerThread.run(ThreadPool.java:1242)

Stack trace:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
IceInternal.Outgoing.invoke(Outgoing.java:137)
com.hp.halo.HaloICE._DatabaseServicesDelM.getEventById(Unknown Source)
com.hp.halo.HaloICE.DatabaseServicesPrxHelper.getEventById(Unknown Source)
com.hp.halo.HaloICE.DatabaseServicesPrxHelper.getEventById(Unknown Source)
com.hp.halo.eventservice.EventConfiguration.initializeEventConfiguration(EventConfiguration.java:200)
com.hp.halo.eventservice.EventConfiguration.<init>(EventConfiguration.java:176)
com.hp.halo.eventservice.EventServicesImpl2.updateEvent(EventServicesImpl2.java:975)
com.hp.halo.HaloICE._EventServicesDisp.___updateEvent(Unknown Source)
com.hp.halo.HaloICE._EventServicesDisp.__dispatch(Unknown Source)
IceInternal.Incoming.invoke(Incoming.java:147)
Ice.ConnectionI.invokeAll(ConnectionI.java:2249)
Ice.ConnectionI.message(ConnectionI.java:1362)
IceInternal.ThreadPool.run(ThreadPool.java:782)
IceInternal.ThreadPool.access$100(ThreadPool.java:12)
IceInternal.ThreadPool$EventHandlerThread.run(ThreadPool.java:1242)

What was common to all ~50 of the hung threads was the following line:
IceInternal.Outgoing.invoke(Outgoing.java:137)

The offending block of Outgoing.java immediately below:
if(timedOut)
{
//
// Must be called outside the synchronization of
// this object
//
_connection.exception(new Ice.TimeoutException());

//
// We must wait until the exception set above has
// propagated to this Outgoing object.
//
synchronized(this)
{
while(_state == StateInProgress)
{
try
{
wait();
}
catch(InterruptedException ex)
{
}
}
}
}

Certain Bug #1: the _state variable is not declared as volatile. Thus when another thread calls Output.finished(...) to change _state and call notify(), the thread which is in the block below does not have to go to memory to get the updated value, but can simply reference a value cached in a register. The result is a thread hung in the loop above. Consult the java memory model for confirmation (chapter 17 of the java language specification, or Brian Goetz's Java Concurrency in Practice)

Possible Bug #2: It appears to me that the call to
_connection.exception(new Ice.TimeoutException());
will ultimately result in a call to Output.finished(). If this is the case, nothing prevents the call to Output.finished(...) from completing prior to entering the synchronized(this) block in the code snippet above, again resulting in a hung thread, as the notify() occurs prior to the wait().

mes · February 2009

Hi Dirk,

It's not necessary to mark a variable as volatile if you access it within a synchronized block, as the JVM guarantees that the thread performing the synchronization sees the most recent value.

I notice from your stack traces that the servants are performing nested synchronous invocations. This is a common cause of deadlocks, and we usually recommend the use of asynchronous operations to avoid such issues. It would help for us to see the stack traces for all of the threads in this application, as well as its thread pool configuration settings. If you send those to us we'll be happy to take a closer look at it.

Regards,
Mark

dhogan · February 2009

Good point - I overlooked that detail. Because you are both reading-from and writing-to _state within a synchronized block, main memory is always consulted.

I am still not convinced that the other issue is not a bug. I see no guarantee that the call to wait() in the if (timedOut) block occurs prior to the call to finished(LocalException). Please specify how this is guaranteed.

Unfortunately, I did not get all stack traces, but the two I have were very representative - there were multiple instances (10s) of these traces. If you are hunting for a shared monitor across these threads, you search in vain. ALL of the traces were exactly the same: they were all the wait() call on line 137 of Outgoing.java - clearly no deadlock, but rather a whole bunch of threads in a wait() with no corresponding notify(). The acquireLock is simply a token in a database that allows the first caller to mutate a teleconference shared among multiple participants/mutators. There is no java synchronization involved. Furthermore, I got these stack traces by running the java jconsole application, which allows you to test for deadlock, which I did (it was negative).

As far as nested synchronous invocations are concerned, are you talking about the fact that EventServices is calling DatabaseServices? Chained synchronous RPC calls are certainly very common, and, as I mentioned above, there is no java synchronization involved (and certainly not involving multiple locks obtained in differing orders).

As far as the threading configurations are concerned, the services are running in a IceBox from which they inherit their properties, and each has a
Ice.ThreadPool.SizeMax=100. No other thread-related configurations are present.

Dirk

mes · February 2009

Hi Dirk,

Regarding the potential bug you mentioned: If Outgoing.finished() is invoked before Outgoing.invoke() enters its synchronized block, then the _state member will not be set to StateInProgress and therefore the wait loop will never begin.

I asked to see all of your stack traces because I was interested to know whether your thread pools had been exhausted. Nested synchronous invocations are dangerous because they can consume all of the threads in a thread pool and cause a situation similar to yours. Note that this is not a "deadlock" in the typical sense (such as incorrect ordering of locks), but rather a "distributed deadlock".

Holding a lock while making a distributed invocation is one obvious example. Another example is when a complex chain of invocations and callbacks are involved. For instance, suppose all of the server-side thread pool threads in Peer A are blocked on nested invocations to Peer B. In this situation, Peer A is essentially dead in the water until one of the nested invocations completes. If the completion of one of those operations also happens to involve a callback from Peer B to Peer A, you've got a deadlock because no server-side thread pool threads are available to dispatch the callback.

Ice uses separate client and server thread pools to try to alleviate some of these situations, but the possibility still exists when more complex applications make use of bidirectional connections and asynchronous invocations.

For example, even though all of the server-side thread pool threads are blocked on nested invocations, the client-side thread pool normally has at least one thread available to process replies and eventually unblock those server threads. However, if the client-side thread pool threads are also blocked (e.g., while dispatching a bidirectional callback or invoking an AMI callback object), then the program is in trouble again because now no threads are available to process replies.

In any event, this is what I suspect is happening in your program, but without more information it's difficult to be certain.

By the way, Bernard wrote a great two-part series of articles on avoiding deadlocks that might be helpful, starting in issue 4.

Regards,
Mark

dhogan · February 2009

OK - I'm convinced about the absence of a bug. Thank-you.

I also see a thread size warning log about the time we encountered this problem, lending credence to the thread starvation issue. Also the ConnectionTimeout exceptions would also occur in the context of thread-starvation.

The two services in question, EventServices and DatabaseServices, are deployed across 3 boxes by grid. I would like to get co-location optimizations between these two services, but I am hesitant to do so because a shared Communicator means a shared server thread-pool. There is a disparity in consumption between these two services: DatabaseServices has more clients than EventServices. Can I get differing adapter thread-pool SizeMax values despite using a shared communicator by defining adapter-specific ThreadPool sizes? Do I even need to specify a SizeMax value, and thus scale 'infinitely' with demand? (we do ssl with authenticated clients so I am less worried about denial-of-service attacks)

And finally, just to be sure, we do not use AMI, and thus the Ice.ThreadPool.Client size can simply be left unconfigured to its default size of 1? In other words, we don't get any scaling benefits by configuring this value to a larger size (again in the absence of AMI).

Thanks

Dirk

mes · February 2009

Can I get differing adapter thread-pool SizeMax values despite using a shared communicator by defining adapter-specific ThreadPool sizes?

Yes, your collocated IceBox services could use adapter-specific thread pools to segregate requests. This is also a convenient way of avoiding thread starvation when an application's request flow has been analyzed carefully.

Do I even need to specify a SizeMax value, and thus scale 'infinitely' with demand?

Ice doesn't support the concept of an unlimited thread pool size; you would have to set a large value for SizeMax and hope that you never hit the limit.

As I mentioned earlier, using asynchronous invocations would improve scalability without the risk of thread starvation, at the cost of additional complexity in your code. Issue 4 contains an article that Matthew wrote on the asynchronous API. The last topic in his article discusses AMI/AMD chaining, which is a very efficient way of handling nested invocations.

And finally, just to be sure, we do not use AMI, and thus the Ice.ThreadPool.Client size can simply be left unconfigured to its default size of 1?

That's correct. Normally you would only need to increase the size of the client-side thread pool if you are receiving callbacks over bidirectional connections, or if you have AMI callbacks performing time-consuming activities.

Regards,
Mark

Archived

bug in IceInternal.Outgoing in Ice 3.2.1

Comments

Categories