Archived

This forum has been archived. Please start a new discussion on GitHub.

15 minute IceGrid timeout after the server host is shutdown

Hi,
we run a distributed system through an IceGrid node/registry on a central machine with several nodes on server hosts, plus a few client hosts. The server hosts are shutdown abruptly by turning off their power. If a process from a client host tries to contact a server on a recently shutdown host it will experience an extremely long delay (around 15 min) until receiving an Ice::NoEndpointException. This happens reliably every time.

We've tried a few variations, here's when the client does NOT hang:
- the server is executed manually, i.e. not through IceGrid.
- the client uses a direct proxy, i.e. does not have to go through the registry to connect.
- the server host is down when the node/registry starts, or the client tries to contact the server a "long" time after the server host is shutdown.

So it seems to be a problem when trying to connect right after the shutdown, through IceGrid, using indirect proxies. We run Ice-3.4.2 on Linux.

Any hints on how to avoid this would be greatly appreciated.
Alen

=== configuration of the central node/registry
Ice.Default.Locator=IceGrid/Locator:tcp -h central -p 12000 IceGrid.Registry.Client.Endpoints=tcp -h central -p 12000
# adding timeout has no effect on the hanging problem
#IceGrid.Registry.Client.Endpoints=tcp -h central -p 12000 -t 45000 IceGrid.Registry.Server.Endpoints=tcp -h central IceGrid.Registry.Internal.Endpoints=tcp -h central
IceGrid.Registry.DynamicRegistration=1
IceGrid.Node.Name=central
IceGrid.Node.Endpoints=tcp -h central
IceGrid.Node.CollocateRegistry=1
Ice.ThreadPool.Client.SizeMax=10
Ice.ThreadPool.Server.SizeMax=10

== output of the client with Ice.Tracer.Protocol=1
-- 08/22/12 11:41:11.855 ./hanger: Protocol: sending asynchronous request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 77
request id = 2
identity = IceGrid/Locator
facet =
operation = findAdapterById
mode = 1 (nonmutating)
context =

== 1 minute later

-- 08/22/12 11:42:17.846 ./hanger: Protocol: sending close connection
message type = 4 (close connection)
compression status = 1 (not compressed; compress response, if any)
message size = 14

== 14 minutes later

-- 08/22/12 11:56:54.687 ./hanger: Protocol: received reply
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 27
request id = 2
reply status = 0 (ok)
-- 08/22/12 11:56:54.690 ./hanger: Retry: retrying operation call because of exception
Reference.cpp:1566: Ice::NoEndpointException:
no suitable endpoint available for proxy `test -t @ MyServer'
-- 08/22/12 11:56:54.690 ./hanger: Protocol: sending asynchronous request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 77
request id = 3
identity = IceGrid/Locator
facet =
operation = findAdapterById
mode = 1 (nonmutating)
context =
-- 08/22/12 11:56:54.702 ./hanger: Protocol: received reply
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 27
request id = 3
reply status = 0 (ok)
-- 08/22/12 11:56:54.704 ./hanger: Retry: cannot retry operation call because retry limit has been exceeded
Reference.cpp:1566: Ice::NoEndpointException:
no suitable endpoint available for proxy `test -t @ MyServer'

=====================

Comments

  • bernard
    bernard Jupiter, FL
    Hi Alen,

    Welcome to our forums!

    As background info, when a client contacts the IceGrid registry to resolve an adapter ID, the registry in turns contacts the node where the corresponding adapter ID / server is running to get the endpoints. The registry does not cache this information: it always contacts the node(s).

    Here, it would be interesting to turn on Network tracing in your IceGrid registry.

    If you see that for the registry is attempting to connect to the node on the powered-off computer and this connection attempt hangs for a long time, the solution would be to put a timeout in the IceGrid.Node.Endpoints in the config file of the node on the powered-off machine.
    We've tried a few variations, here's when the client does NOT hang:
    [...]
    - the server host is down when the node/registry starts, or the client tries to contact the server a "long" time after the server host is shutdown.

    By default, IceGrid will detect that a node is down after 30 seconds, and would not attempt to contact it. See IceGrid.Registry.NodeSessionTimeout.

    You could also use icegridadmin or the IceGrid GUI to check if the registry sees your nodes as running or not.
    IceGrid.Registry.DynamicRegistration=1
    You probably set this DynamicRegistration when trying servers started manually; if you no longer need this, please remove it.
    IceGrid.Node.CollocateRegistry=1
    For a large multi-computer deployment, it would be clearer to separate registry and node.

    Best regards,
    Bernard
  • Thanks Bernard, you reply have clarified many things.
    bernard wrote: »
    The registry does not cache this information: it always contacts the node(s).

    I was not aware that the Registry retrieves adapter endpoints from the Nodes every time there's a locator request. I guess the "dynamic object adapters" are cached by the Registry but the ones administered by IceGrid are not.
    bernard wrote: »
    Here, it would be interesting to turn on Network tracing in your IceGrid registry.

    I've looked at the Registry traces right after the server host shutdown and saw some calls to getDirectProxy() on the now-shutdown Node. I guess this is it.
    bernard wrote: »
    the solution would be to put a timeout in the IceGrid.Node.Endpoints in the config file of the node on the powered-off machine.

    I've added a 45s timeout to the Node endpoints (longer than the
    IceGrid.Node.WaitTime=30) and it has resolved the hanging problem: the remote call to the server now returns after approximately 50sec, retries again and fails immediately.

    Thanks for your help, I'll now look at your other suggestions.
    Alen