IceGrid lock up/unresponsive

tctimmeh · January 2008

I'm evaluating IceGrid in 3.2.1 for use as our location service and seem to be running into some unexpected behaviour.

I have a master registry and 2 replicas setup with some sample application deployments. Things work quite smoothly most of the time but occasionally the master registry becomes completely unresponsive and needs to be kill -9'd. Whenever this happens I shut down all Ice-related processes and attempt to restart the master registry. It starts but is unresponsive (won't accept connections) for 5-10 minutes, then seems to work itself out and all is ok again for a while.

I'm still trying to determine a series of events to reliably reproduce this problem but can anyone suggest something I can check or tweak or offer up a guess as to what might be going wrong?

benoit · January 2008

Hi,

Could you attach with the debugger and get a thread dump of the master registry when it hangs (preferably with a debug version)? Which OS/platform are you using?

Cheers,
Benoit.

tctimmeh · January 2008

Sorry for my ignorance but could you let me know how to get the output you're looking for? My master registry is running on a debian box and it's actually in the seizure state right now.

I renamed one of the replicas and both the master and the other replica stopped accepting connections. Restarting the master doesn't help; I can start it but it doesn't respond to anything.

benoit · January 2008

You need to attach to the icegridregistry process with the gdb debugger. In the following example, we run the "script" command to capture the terminal output into a file and then run gdb to attach to the icegridregistry process with the PID 16264:

$ script
$ ps ax | grep icegridregistry 
16264 s000  S      0:00.46 icegridregistry --Ice.Config=config.grid
16268 s000  R+     0:00.00 grep icegridregistry
$ gdb icegridregistry 16264
...
(gdb) thread apply all bt
... <thread dump> ...
(gdb) exit
$ exit
Script done, output file is typescript

This should create a text file named "typescript" that contains the stack traces of the registry. I'll be happy to take a look at these stack traces to see if I can figure out why the registry on your machine hangs.

Let us know if you need more information about these instructions!

Could you also post the configuration files of your registries? For instance, do you specify the "-h" option for the registry endpoints? If not, the registry will listen on all the machine local interfaces and you have to make sure these interfaces are reachable by other registries. Otherwise, this could cause some hangs. If you don't want the registry to listen on all the local interfaces, you can specify the -h option in the registry endpoints (e.g.: IceGrid.Registry.Internal.Endpoints="tcp -h 192.168.5.14").

Cheers,
Benoit.

tctimmeh · January 2008

As requested, 3 config files and a thread dump.

benoit · January 2008

Hi,

The stack trace shows that the registry is blocked trying to connect to a slave or a node. I suspect this is because you don't specify explicitly on which network interface the registry endpoints should listen on. You're also not using timeouts which is causing the connect() calls to hang.

To investigate these hangs, you should set the following properties in the IceGrid registry to obtain some tracing information:

IceGrid.Registry.Trace.Replica=2
Ice.Trace.Network=2
Ice.Trace.Retry=1

This should trace the connection attempts to the slave by the master and the hostname/IP addresses used.

In any case, if your machine have multiple interfaces and some of these interfaces are not reachable by other hosts, I recommend adding the -h option in the registry endpoint configuration. You should also use timeouts. For example:

IceGrid.Registry.Client.Endpoints = tcp -p 4061 -h madbox -t 10000
IceGrid.Registry.Server.Endpoints = tcp -h madbox -t 10000
IceGrid.Registry.Internal.Endpoints = tcp -h madbox -t 10000

See also the configuration of the demo/IceGrid/replication demo included in your Ice distribution for an example. If this still doesn't solve your problem, could you please post the traces generated by the registry with the tracing properties enabled? I'd be happy to look at them.

Cheers,
Benoit.

tctimmeh · January 2008

Thanks for the quick response. Your suspicions were correct. It seems that one of the replica machines (wirelock) has a network adapter configured for a subnet that isn't accessible to the master registry. It looks like the master registry was trying to establish a connection with the replica on both addresses and blocking on the one that it couldn't reach.

I can solve the whole thing by configuring the troublesome replica so that it doesn't advertise an endpoint on the inaccessible subnet but there's still some things that aren't behaving the way I understand. I shut down both replicas and changed my master registry endpoints to read:

IceGrid.Registry.Client.Endpoints = tcp -p 4061 -h maddox -t 5000 : tcp -p 4061 -h localhost -t 1000
IceGrid.Registry.Server.Endpoints = tcp -t 5000 -h maddox : tcp -t 5000 -h localhost
IceGrid.Registry.Internal.Endpoints = tcp -t 5000 -h maddox : tcp -t 5000 -h localhost

but it still doesn't always recover without a bit of luck and a lot of waiting. If the master registry knew about the inaccessible endpoint at some point in the recent past it still tries to connect to it on startup and never seems to give up (or takes several minutes). For instance here is a trace of this occuring:

[ 01/18/08 13:08:41.129 Network: trying to establish tcp connection to 127.0.0.1:4061 ]
[ 01/18/08 13:08:41.130 Network: trying to establish tcp connection to 10.68.229.67:4061 ]
[ 01/18/08 13:08:41.131 Retry: retrying operation call because of exception
  Network.cpp:679: Ice::ConnectionRefusedException:
  connection refused: Connection refused ]
[ 01/18/08 13:08:41.133 Network: trying to establish tcp connection to 10.68.229.67:4061 ]
[ 01/18/08 13:08:41.134 Network: trying to establish tcp connection to 127.0.0.1:4061 ]
[ 01/18/08 13:08:41.135 Retry: cannot retry operation call because retry limit has been exceeded
  Network.cpp:679: Ice::ConnectionRefusedException:
  connection refused: Connection refused ]
[ 01/18/08 13:08:41.136 Network: attempting to bind to tcp socket 10.68.229.67:0 ]
[ 01/18/08 13:08:41.136 Network: accepting tcp connections at 10.68.229.67:45220 ]
[ 01/18/08 13:08:41.136 Network: attempting to bind to tcp socket 127.0.0.1:0 ]
[ 01/18/08 13:08:41.136 Network: accepting tcp connections at 127.0.0.1:50929 ]
[ 01/18/08 13:08:41.155 Network: trying to establish tcp connection to 127.0.0.1:50929 ]
[ 01/18/08 13:08:41.155 Network: tcp connection established
  local address = 127.0.0.1:33780
  remote address = 127.0.0.1:50929 ]
[ 01/18/08 13:08:41.156 Network: accepted tcp connection
  local address = 127.0.0.1:50929
  remote address = 127.0.0.1:33780 ]
[ 01/18/08 13:08:41.174 Replica: creating replica `Wonder Boy' session ]
[ 01/18/08 13:08:41.174 Network: trying to establish tcp connection to 172.16.62.1:44968 ]
[ 01/18/08 13:10:41.128 Network: shutting down tcp connection for writing
  local address = 127.0.0.1:50929
  remote address = 127.0.0.1:33780 ]
[ 01/18/08 13:10:41.128 Network: closing tcp connection
  local address = 127.0.0.1:50929
  remote address = 127.0.0.1:33780 ]
[ 01/18/08 13:10:41.128 Network: shutting down tcp connection for writing
  local address = 127.0.0.1:33780
  remote address = 127.0.0.1:50929 ]
[ 01/18/08 13:10:41.129 Network: closing tcp connection
  local address = 127.0.0.1:33780
  remote address = 127.0.0.1:50929 ]
[ 01/18/08 13:11:50.172 Network: trying to establish tcp connection to 10.68.192.6:48429 ]
[ 01/18/08 13:11:50.173 Retry: retrying operation call because of exception
  Network.cpp:679: Ice::ConnectionRefusedException:
  connection refused: Connection refused ]
[ 01/18/08 13:11:50.173 Network: trying to establish tcp connection to 172.16.62.1:44968 ]

The 172.16.62.1 address is the one that's given me trouble. It seems to be waiting a lot more than 5 seconds and the whole master registry is unresponsive to connect attempts for several minutes.

benoit · January 2008

Did you also update the configuration of the replicas to add the -h and -t options to the endpoints?

It's possible that the master registry tries to connect once with the old endpoints to the replica but if it fails it will remove them from its database and if it succeeds it will update them. So, the next time the master registry is re-started it should use the new endpoints (the ones defined by the replica in its configuration file).

To prevent the master registry from hanging when it connects to invalid endpoints, you can set the Ice.Override.ConnectTimeout property (a timeout value in milli-seconds, see the Ice manual for more information). However, the best is to correctly configure the replicas and ensure they don't publish unreachable endpoints.

Cheers,
Benoit.

tctimmeh · January 2008

Yes, specifically adding -h to the troublesome replica is what did the trick to solve the problem permanently but I was also trying to find a reliable way to recover the master registry from the seizure state if this should happen again in the future.

The Override.ConnectTimeout setting did the trick. I induced the problem, added that option to the master registry's config file, and started it up. As before it tried to connect to the inaccessible address but this time gave up after 5 seconds.

Thanks a lot for your prompt assistance. I think that pretty much wraps up this issue.

Archived

IceGrid lock up/unresponsive

Comments

Categories