python: shutdown of communicator hangs

hiasl · January 2007

Hi,

in our project, we have the scenario of changing IP addresses for participants. To adapt ICE to new situations an extra thread keeps track of the actual IP and causes ICE to shut down (before the IP is changed) and restart ICE (after the IP was changed) respectively. We have a test system of more than 20 PCs (Linux), which keep proxies to each other calling time consuming functions (2s). When the address of a computer changes, sometimes the shutdown functions blocks indefiinetly. I have read in another thread that this could happen due to a deadlock when destroy() is called within a thread of the communicator (?). But this is not the case since the extra thread was createdwithin python. Are there any other reasons, why shutdown() blocks indefiinetly? I suppose this happens, when a proxy call is done via the communicator and the remote servant is not available any longer, since the remote computers IP has changed, too...

def shutdown(...):
#shutdown communicators
self._ftComS.shutdown()
self._ftComS.destroy()
self._ftComC.shutdown()
self._ftComC.destroy()
self._psCom.shutdown()
self._psCom.destroy()

def start(...):
#create communicators
self._ftComS = Ice.initialize()
self._ftComC = Ice.initialize()
self._psCom = Ice.initialize()

#register object factories
for iceId in self._ftWrapperObjectFactories:
objF = self._ftWrapperObjectFactories[iceId]
self._ftComS.addObjectFactory(objF, iceId)
self._ftComC.addObjectFactory(objF, iceId)
for iceId in self._psWrapperObjectFactories:
objF = self._psWrapperObjectFactories[iceId]
self._psComS.addObjectFactory(objF, iceId)
self._psComC.addObjectFactory(objF, iceId)

#create adapters
self._ftAdapterS = self._ftComS.createObjectAdapterWithEndpoints(
self._nRep.ID,
self._ftProt + \
' -t ' + str(int(self._timeoutMSec)) + \
' -h ' + ip + \
' -p ' + self._ftPortS
)
self._ftAdapterC = self._ftComC.createObjectAdapterWithEndpoints(
self._nRep.ID,
self._ftProt + \
' -t ' + str(self._timeoutMSec) + \
' -h ' + ip + \
' -p ' + self._ftPortC
)
self._psAdapterS = self._psCom.createObjectAdapterWithEndpoints( \
self._nRep.ID,
self._psProt + \
' -t ' + str(self._timeoutMSec) + \
' -h ' + ip + \
' -p ' + self._psPort
)
self._psAdapterC = self._psAdapterS

mes · January 2007

Hi,

If you could use gdb to attach to the python interpreter process that is hung and obtain a stack trace for each of its threads, we could probably determine what is causing the problem.

It is possible that a network issue is causing the hang. Are you setting timeouts on your proxies?

Take care,
- Mark

hiasl · January 2007

Hi Marc,

i have attached the backtrace from gdb and the code for dealing with Ice, since it may answer some further questions, which may arise.

line 99: Configation of Ice
line 62: The oo-block

as already written the methods "shutdown" and "start" are called from an extra thread.

Thanks for your help!
Matthias

mes · February 2007

Hi,

I looked at the stack traces.

In Thread 183, Ice is dispatching an operation to a servant. I don't know the name of the operation, but you could determine this by going to the stack frame for the dispatch function (#19 in this case) and examining the value of "current". This thread appears to be blocked while attempting to acquire a lock.

In Thread 42, the program has called destroy on a communicator. Assuming that this thread is referring to the same communicator instance as thread 183, destroy cannot complete until the operation being dispatched in thread 183 completes.

My best guess is that you've got a deadlock somewhere.

Hope that helps.

- Mark

hiasl · February 2007

Hi Marc,

thanks for your hint to take a look into "current". it helped me to find at least one reason, why adapter.waitForDeactivate() or communicator.destroy() block forever:

it seems, that python may not do any blocking operation within an incoming ice call. In my code i do:

self._trackPid = os.spawnvp(os.P_NOWAIT, 'madplay', )
os.waitpid(self._trackPid, 0)
self._trackPid = 0

This starts a shell mp3 player and waits for its end. Calling destroy() while waiting on the player invites me to a huge can of coffee, since os.waitpid doesn't return any longer. The player finishes an turns into a zombie process.
I tried to replace this lines with a time.sleep(10) with the same result. I'm not shure on how sleep is implemented, but i suppose that it leads to an blocking system call, too.

=> It seems not to be a deadlock!

Is this a bug of ICE, which may be solved in version 3.1.1? At this moment, we use version 3.0.1

Bye,
Matthias

mes · February 2007

Hi Matthias,

I can't reproduce the hang you've described with IcePy 3.1.1. In an Ice invocation, my servant spawns a thread and then sleeps for 10s. The thread sleeps for 5s and calls destroy on the communicator. As soon as the servant's call to sleep expires and the invocation completes, the spawned thread's call to destroy returns and everything works as expected.

Instead of calling time.sleep(), I also tried using spawnvp to run the sleep command. This worked fine too.

If you can provide a small, self-contained example that reproduces the problem, we can take a look at it.

Take care,
- Mark

hiasl · February 2007

Hi Marc,

i have updated Ice to v3.1.1. Furthermore i have wrapped all incoming and outgoing calls with a multiple-reader-writer lock in such manner, that the adapter is deactivated and the communicator is destroyed only if NO incoming or outgoing call is pending. Ok, since it is not possible to block the incoming call completely, i lock before calling the function and unlock before returning the result.

Again the adapter and the communicator blocks. It's a bit confusing now, because no sleep or os.spawn is active. I suppose this to be a problem of ice. The only established connection i can imagine is for transmitting the result.

This brings me to one question. How does Ice behave, when the result has to be transmitted but the client's IP has changed or even worse, a second PC got the IP address of the original client? Is the Ice.Override.Timeout property valid for outgoing replies?

Before i forget - we have 21 PCs running the same test program. The failure happens usualy to 0 or 1 applications for each change of the IPs. So, this error is anything else than deterministic

Unfortunately i can't split the application to a simple example and it's too complex for running it somewhere else...

I would appreciate any hint.

Thanks & Bye
Matthias

Archived

python: shutdown of communicator hangs

Comments

Categories