IceGridNode daemon stalls Linux boot process

jharriot · August 2010

Hi,

I'm running IceGridNode (Ice 3.3.1) as a daemon on RHEL 5.2. The IceGridRegistry is running on a separate Windows platform.
If the registry is not running, the Linux boot process stalls for a time period that varies from 30sec to a couple of minutes (for different PCs) when attempting to start the icegridnode daemon. N.B. I have not set any timeouts on connections to the Locator endpoint.

This situation is highly possible as the PCs can be started in any order.

The important thing is for the Linux icegridnode to eventually connect once the registry is alive.

Are there IceGrid properties that control timeouts and retry intervals for IceGridNode connection with the registry?

Are there recommended settings for IceGridNode when run across a network of PCs that startup in any order?

Cheers John

benoit · August 2010

Hi John,

It should be possible to start the IceGrid node before the registry. Order shouldn't matter. It sounds like in your case the IceGrid node doesn't detect in a timely manner that the registry isn't running.

Can you try running the node with --nowarn to see if it makes a difference? By default, the node tries to ping the IceGrid locator with a 15s timeout, with the retry this check can last up to 30s if the locator is unreachable. Passing --nowarn when starting the node will disable this check.

You should also use timeouts for the endpoints of the registry, node and locator proxy. I recommend checking out the configuration files of the demo/IceGrid/replication demo from your Ice distribution for an example where timeouts are configured on the node and registry endpoints.

Cheers,
Benoit.

jharriot · August 2010

Hi,

Futher investigations (with the registry inactive):
a. I have run IceGridNode as a console application and as a daemon (service icegridnode start).
b. I redirected the Ice.Stderr to a file so I could examine Ice trace statements.
c. I set the Locator endpoint timeout to 5 secs.

In both instances the trace log showed two initial attempts to connect with the registry spaced about 15 sec apart. Subsequent attempts were 5 sec apart (as I expected with timeout=5000 ms).

While IceGridNode was attempting to connect "ps -ae | grep icegrid" showed 4 processes, one of which was defunct. After 70 secs the service responded with the [OK] msg to indicate the service had started. Also the "ps -ae | grep icegrid" showed only one process active.

Playing around with the endpoint timeout suggests a delay of (30 + 8*timeout) occurs before the service is reported as started, i.e. changing the timeout to 10 sec increased the delay to 110 sec. If I include the --nowarn option the 30 sec initial delay is removed.

FYI. The file /etc/init.d/icegridnode was taken from the examples in the Ice distribution.

benoit · August 2010

Hi John,

You're right, the daemonized node gives back control to the caller only after the node tried to connect with the registry several times. We will look into changing this behavior for the next release, thanks for bringing this to our attention.

Cheers,
Benoit.

Archived

IceGridNode daemon stalls Linux boot process

Comments

Categories