IceGrid Node process experiences strange 20-minute outages at random times

robert.vazan · July 2013

All our IceGrid nodes run smoothly except one node that is experiencing mysterious 20-minute outages, during which the node is half-alive half-dead. Everything else on the machine works fine. I suspect rare race condition in IceGrid Node code that happens to trigger under the specific conditions on this machine.

IceGrid servers on the machine continue to function, but requests for new connections fail with NodeUnreachableException. IceGrid Node disappears from IceGrid Admin GUI and ceases all logging. Interestingly, registry keeps receiving keepalive messages from the node, which indicates that the node is in some way still alive. The node just stops responding to outside requests. Excerpt from node log is quoted below.

After exactly 20 minutes, the node spontaneously recovers, immediately errors out on registry connection (which seems healthy from registry point of view) and reconnects to registry. Interestingly, the outages seem to be aligned to one of three time slots during current hour: xx:00 – xx:20, xx:20 – xx:40, xx:40 – (xx+1):00 where xx is current hour. These slots are drifting a few minutes from week to week, but when I take a few recent failure periods, they are always fairly accurately slotted.

Now let's talk about conditions on the machine that could be related to the issue. The issue started when we migrated the server from raw hardware to virtual machine (Oracle VirtualBox). Perhaps the virtual machine has timing properties that trigger some race rule in IceGrid Node. Lots of software runs on the machine. One software package unique to the machine is a permanently running TeamViewer. System timer resolution is increased to 1,000Hz on all our machines.

I don't have much time to further investigate the issue. Perhaps one of you guys will be able to make an educated guess at possible causes of these intermittent failures. Perhaps the exact length and slotting of the outages corresponds to some specific timing code in IceGrid Node.

-- 07/01/13 12:55:35.763 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveVNET'
-- 07/01/13 12:55:41.384 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveUK'
-- 07/01/13 12:55:45.661 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveUS'
-- 07/01/13 12:55:50.779 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `Master'
-- 07/01/13 12:55:50.853 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveVNET'
-- 07/01/13 12:55:56.458 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveUK'
-- 07/01/13 12:56:00.810 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveUS'
-- 07/01/13 13:16:06.008 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `Master'
-- 07/01/13 13:16:06.078 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveVNET'
-- 07/01/13 13:16:06.160 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: lost session with replica `Master':
Outgoing.cpp:443: Ice::ObjectNotExistException:
object does not exist:
identity: `776a23c7-4e0a-422e-8c23-f32b0f63aa27'
facet:
operation: keepAlive
-- 07/01/13 13:16:06.160 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: trying to establish session with replica `Master'
-- 07/01/13 13:16:06.219 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: lost session with replica `SlaveVNET':
Outgoing.cpp:443: Ice::ObjectNotExistException:
object does not exist:
identity: `eba8d943-0173-46f7-946e-da7f7fb75065'
facet:
operation: keepAlive
-- 07/01/13 13:16:06.219 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: trying to establish session with replica `SlaveVNET'
-- 07/01/13 13:16:06.414 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: established session with replica `Master'
-- 07/01/13 13:16:06.500 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: established session with replica `SlaveVNET'
-- 07/01/13 13:16:06.570 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveUK'
-- 07/01/13 13:16:06.570 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveVNET'
-- 07/01/13 13:16:06.570 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveUS'
-- 07/01/13 13:16:06.797 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: lost session with replica `SlaveUK':
Outgoing.cpp:443: Ice::ObjectNotExistException:
object does not exist:
identity: `12a85d1d-49b0-4891-a9bf-667ff609e3e4'
facet:
operation: keepAlive
-- 07/01/13 13:16:06.797 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: trying to establish session with replica `SlaveUK'
-- 07/01/13 13:16:07.012 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: lost session with replica `SlaveUS':
Outgoing.cpp:443: Ice::ObjectNotExistException:
object does not exist:
identity: `8d86efa0-8fb9-4fec-ac15-0b701ea7558f'
facet:
operation: keepAlive
-- 07/01/13 13:16:07.012 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: trying to establish session with replica `SlaveUS'
-- 07/01/13 13:16:07.195 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: established session with replica `SlaveUK'
-- 07/01/13 13:16:07.778 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: established session with replica `SlaveUS'
-- 07/01/13 13:16:21.528 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `Master'
-- 07/01/13 13:16:21.618 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveVNET'
-- 07/01/13 13:16:22.200 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveUK'
-- 07/01/13 13:16:22.782 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `SlaveUS'
-- 07/01/13 13:16:36.579 c:/Program Files (x86)/ZeroC/Ice-3.5.0/bin/icegridnode.exe: Replica: sending keep alive message to replica `Master'

benoit · July 2013

Hi,

Where do you see the NodeUnreachableException? Does it contain more information indicating the reason of the failure?

When this occurs, you could try to execute "node load <node-id>" with the icegridadmin command line client, it should show a message indicating that the node can't be reach and it should also provide the reason.

We recently had a customer with a similar half dead/half alive IceGrid node. The node TCP/IP connection to the registry was working correctly but the registry was un-able to establish a connection to the node because the node advertised a bogus IP address for the node DNS name (and this DNS name was used to define the node IceGrid.Node.Endpoints property). Could there be a similar connectivity issue between the registry and node during those 20 minutes?

Cheers,
Benoit.

robert.vazan · July 2013

The node is configured with fixed IP address:

IceGrid.Node.Endpoints=tcp -h 172.16.10.10 -t 10000

When IceGrid Node stops responding, I can still connect to the machine at this IP address. Network connectivity seems to be alright. All our servers are configured with static IP addresses.

NodeUnreachableException is generated somewhere in this code (custom nagios check):

IceGrid::AdminSessionPrx session = registry->createAdminSession("nagios", "");
IceGrid::AdminPrx gridAdmin = session->getAdmin();
Ice::ObjectPrx serverAdmin = gridAdmin->getServerAdmin(serverId); // "MYNODE.MyServer"
ProbePrx probe = CommonIce::ProbesPrx::uncheckedCast(serverAdmin, "Probes");
probe->ice_ping();

I think the call that throws is AdminPrx.getServerAdmin(). It throws at the same time for all IceGrid servers running on the machine.

I will try to find out more details about the exception. Since the issue happens randomly, sometimes several times per day and sometimes only once per week, it can take some time to capture the exception details.

benoit · July 2013

Hi,

Yes, it's likely the getServerAdmin call that throws the NodeUnreachableException. Could you improve your nagios check to also print out the "reason" member of the IceGrid::NodeUnreachableException?

Cheers,
Benoit.

robert.vazan · July 2013

Sorry for the delay. It seems that my previous post didn't make it through.

NodeUnreachableException reason field: the node is not active

I don't understand how can registry possibly report this error when it receives periodic keepalive messages from the node. How can it tolerate the inconsistency without logging errors about it?

robert.vazan · July 2013

This morning something more serious happened. The node went down for the first time for longer than 20 minutes. Around midnight these errors have been logged by the node:

... first four lines repeat ...
-- 07/04/13 01:02:10.556 ...: Replica: sending keep alive message to replica `SlaveUK'
-- 07/04/13 01:02:11.612 ...: Replica: sending keep alive message to replica `Master'
-- 07/04/13 01:02:15.333 ...: Replica: sending keep alive message to replica `SlaveUS'
-- 07/04/13 01:02:18.693 ...: Replica: sending keep alive message to replica `SlaveVNET'
-- 07/04/13 01:02:28.225 ...: Replica: lost session with replica `SlaveUK':
TcpTransceiver.cpp:401: Ice::SocketException:
socket exception: The semaphore timeout period has expired.
-- 07/04/13 01:02:28.225 ...: Replica: trying to establish session with replica `SlaveUK'
-- 07/04/13 01:02:33.120 ...: Replica: lost session with replica `SlaveUS':
TcpTransceiver.cpp:401: Ice::SocketException:
socket exception: The semaphore timeout period has expired.
-- 07/04/13 01:02:33.120 ...: Replica: trying to establish session with replica `SlaveUS'
!! 07/04/13 01:02:34.288 ...: error: a node with the same name is already active with the replica `SlaveUS'
-- 07/04/13 01:02:34.288 ...: Replica: failed to establish session with replica `SlaveUS':
IceGrid::NodeActiveException
-- 07/04/13 01:02:36.382 ...: Replica: lost session with replica `SlaveVNET':
TcpTransceiver.cpp:401: Ice::SocketException:
socket exception: The semaphore timeout period has expired.
-- 07/04/13 01:02:36.382 ...: Replica: trying to establish session with replica `SlaveVNET'
!! 07/04/13 01:02:36.572 ...: error: a node with the same name is already active with the replica `SlaveVNET'
-- 07/04/13 01:02:36.572 ...: Replica: failed to establish session with replica `SlaveVNET':
IceGrid::NodeActiveException
-- 07/04/13 01:02:37.639 ...: Replica: established session with replica `SlaveUK'
-- 07/04/13 01:02:50.024 ...: Replica: trying to establish session with replica `SlaveUS'
-- 07/04/13 01:02:51.022 ...: Replica: established session with replica `SlaveUS'
-- 07/04/13 01:02:52.344 ...: Replica: trying to establish session with replica `SlaveVNET'
-- 07/04/13 01:02:52.599 ...: Replica: established session with replica `SlaveVNET'
-- 07/04/13 01:02:53.334 ...: Replica: sending keep alive message to replica `SlaveUK'
-- 07/04/13 01:03:06.023 ...: Replica: sending keep alive message to replica `SlaveUS'
-- 07/04/13 01:03:07.599 ...: Replica: sending keep alive message to replica `SlaveVNET'
... last three lines repeat ...

As you can see, connection to the Master registry has been lost. The first error logged is "The semaphore timeout period has expired.", which again points to timing issues triggering some race condition. Notice that the node never logs any attempt to reconnect to Master registry. There is no mention of Master in the log until node restart performed a few hours later, which produced the log below. I am getting NodeActiveException here as if the old and the new node are fighting for presence in the registry. Also notice that Master registry doesn't throw NodeActiveException, which means it has long forgotten about the node.

... pattern from previous log repeats ...
-- 07/04/13 12:40:51.455 ...: Replica: trying to establish session with replica `Master'
-- 07/04/13 12:40:52.205 ...: Replica: established session with replica `Master'
-- 07/04/13 12:40:52.429 ...: Replica: trying to establish session with replica `SlaveUK'
-- 07/04/13 12:40:52.430 ...: Replica: trying to establish session with replica `SlaveVNET'
-- 07/04/13 12:40:52.431 ...: Replica: trying to establish session with replica `SlaveUS'
-- 07/04/13 12:40:52.557 ...: Server: changed server `UK.ShA' state to `Loading'
-- 07/04/13 12:40:52.666 ...: Server: changed server `UK.ShB' state to `Loading'
-- 07/04/13 12:40:53.100 ...: Server: changed server `UK.ShA' state to `Activating'
-- 07/04/13 12:40:53.155 ...: Activator: activating server `UK.ShA'
path = c:\ice\ServiceHost\WBP.Grid.ServiceHost.exe
pwd = c:\ice
args = c:\ice\ServiceHost\WBP.Grid.ServiceHost.exe --Ice.Config=c:\ice/db/node/servers/UK.ShA/config/config
-- 07/04/13 12:40:53.155 ...: Server: changed server `UK.ShA' state to `WaitForActivation'
-- 07/04/13 12:40:53.214 ...: Server: changed server `UK.ShB' state to `Activating'
-- 07/04/13 12:40:53.264 ...: Activator: activating server `UK.ShB'
path = c:\ice\ServiceHost\WBP.Grid.ServiceHost.exe
pwd = c:\ice
args = c:\ice\ServiceHost\WBP.Grid.ServiceHost.exe --Ice.Config=c:\ice/db/node/servers/UK.ShB/config/config
-- 07/04/13 12:40:53.264 ...: Server: changed server `UK.ShB' state to `WaitForActivation'
-- 07/04/13 12:41:07.758 ...: Replica: sending keep alive message to replica `Master'
-- 07/04/13 12:41:13.315 ...: Activator: detected termination of server `UK.ShB'
-- 07/04/13 12:41:13.316 ...: Server: changed server `UK.ShB' state to `Deactivating'
-- 07/04/13 12:41:13.316 ...: Server: changed server `UK.ShB' state to `Inactive'
-- 07/04/13 12:41:13.331 ...: Activator: detected termination of server `UK.ShA'
-- 07/04/13 12:41:13.331 ...: Server: changed server `UK.ShA' state to `Deactivating'
-- 07/04/13 12:41:13.331 ...: Adapter: server `UK.ShA' adapter `UK.ShA.Adapter' deactivated
-- 07/04/13 12:41:13.331 ...: Server: changed server `UK.ShA' state to `Inactive'
-- 07/04/13 12:41:13.358 ...: Adapter: waiting for activation of server `UK.ShA' adapter `UK.ShA.Adapter'
-- 07/04/13 12:41:13.358 ...: Server: changed server `UK.ShA' state to `Activating'
-- 07/04/13 12:41:13.358 ...: Activator: activating server `UK.ShA'
path = c:\ice\ServiceHost\WBP.Grid.ServiceHost.exe
pwd = c:\ice
args = c:\ice\ServiceHost\WBP.Grid.ServiceHost.exe --Ice.Config=c:\ice/db/node/servers/UK.ShA/config/config
-- 07/04/13 12:41:13.360 ...: Server: changed server `UK.ShA' state to `WaitForActivation'
-- 07/04/13 12:41:13.816 ...: Server: changed server `UK.ShB' state to `Activating'
-- 07/04/13 12:41:13.816 ...: Activator: activating server `UK.ShB'
path = c:\ice\ServiceHost\WBP.Grid.ServiceHost.exe
pwd = c:\ice
args = c:\ice\ServiceHost\WBP.Grid.ServiceHost.exe --Ice.Config=c:\ice/db/node/servers/UK.ShB/config/config
-- 07/04/13 12:41:13.817 ...: Server: changed server `UK.ShB' state to `WaitForActivation'
-- 07/04/13 12:41:15.941 ...: Server: changed server `UK.ShB' state to `Active'
-- 07/04/13 12:41:16.029 ...: Server: changed server `UK.ShA' state to `Active'
-- 07/04/13 12:41:16.029 ...: Adapter: server `UK.ShA' adapter `UK.ShA.Adapter' activated: dummy -t -e 1.1:tcp -h 172.16.10.10 -p 63894
-- 07/04/13 12:41:22.672 ...: Replica: sending keep alive message to replica `Master'
!! 07/04/13 12:41:23.014 ...: error: a node with the same name is already active with the replica `SlaveVNET'
-- 07/04/13 12:41:23.014 ...: Replica: failed to establish session with replica `SlaveVNET':
IceGrid::NodeActiveException
!! 07/04/13 12:41:23.333 ...: error: a node with the same name is already active with the replica `SlaveUK'
-- 07/04/13 12:41:23.333 ...: Replica: failed to establish session with replica `SlaveUK':
IceGrid::NodeActiveException
!! 07/04/13 12:41:23.786 ...: error: a node with the same name is already active with the replica `SlaveUS'
-- 07/04/13 12:41:23.786 ...: Replica: failed to establish session with replica `SlaveUS':
IceGrid::NodeActiveException
-- 07/04/13 12:41:33.014 ...: Replica: trying to establish session with replica `SlaveVNET'
!! 07/04/13 12:41:33.225 ...: error: a node with the same name is already active with the replica `SlaveVNET'
-- 07/04/13 12:41:33.225 ...: Replica: failed to establish session with replica `SlaveVNET':
IceGrid::NodeActiveException
-- 07/04/13 12:41:33.333 ...: Replica: trying to establish session with replica `SlaveUK'
-- 07/04/13 12:41:33.748 ...: Replica: established session with replica `SlaveUK'
-- 07/04/13 12:41:33.787 ...: Replica: trying to establish session with replica `SlaveUS'
!! 07/04/13 12:41:34.389 ...: error: a node with the same name is already active with the replica `SlaveUS'
-- 07/04/13 12:41:34.390 ...: Replica: failed to establish session with replica `SlaveUS':
IceGrid::NodeActiveException
-- 07/04/13 12:41:37.717 ...: Replica: sending keep alive message to replica `Master'
-- 07/04/13 12:41:43.228 ...: Replica: trying to establish session with replica `SlaveVNET'
!! 07/04/13 12:41:43.321 ...: error: a node with the same name is already active with the replica `SlaveVNET'
-- 07/04/13 12:41:43.321 ...: Replica: failed to establish session with replica `SlaveVNET':
IceGrid::NodeActiveException
-- 07/04/13 12:41:44.390 ...: Replica: trying to establish session with replica `SlaveUS'
!! 07/04/13 12:41:44.683 ...: error: a node with the same name is already active with the replica `SlaveUS'
-- 07/04/13 12:41:44.683 ...: Replica: failed to establish session with replica `SlaveUS':
IceGrid::NodeActiveException
-- 07/04/13 12:41:48.748 ...: Replica: sending keep alive message to replica `SlaveUK'
-- 07/04/13 12:41:52.763 ...: Replica: sending keep alive message to replica `Master'
-- 07/04/13 12:41:53.326 ...: Replica: trying to establish session with replica `SlaveVNET'
-- 07/04/13 12:41:53.668 ...: Replica: established session with replica `SlaveVNET'
-- 07/04/13 12:41:54.687 ...: Replica: trying to establish session with replica `SlaveUS'
-- 07/04/13 12:41:55.224 ...: Replica: established session with replica `SlaveUS'
-- 07/04/13 12:42:03.828 ...: Replica: sending keep alive message to replica `SlaveUK'
-- 07/04/13 12:42:07.823 ...: Replica: sending keep alive message to replica `Master'
-- 07/04/13 12:42:08.668 ...: Replica: sending keep alive message to replica `SlaveVNET'
-- 07/04/13 12:42:10.224 ...: Replica: sending keep alive message to replica `SlaveUS'
... last four lines repeat ...

benoit · July 2013

Hi,

Are the node and registries on the same physical network? Can you describe your network configuration?

I suspect a network failure occurs and it's causing all the network connections to go down or hang (for Master). I don't know what would cause those sporadic network outages (a hardware or router issue? a cable problem?). Some network outages can cause very long hangs if the TCP/IP socket is left in a "half-open" state. To prevent those hangs for lasting too long, it is important to configure Ice connection timeouts.

Do you specify Ice connection timeouts for the IceGrid registry & node endpoints? If not, you should, I believe this will fix those 20 minutes outages when a network issue occurs, the node might still hang for a little time (depending on the timeout value) but it should recover quickly after that.

To configure the timeouts, I recommend checking out the configuration files from the IceGrid replication demo in the demo/IceGrid/replication directory of the Ice demo source distribution. You will see that all the endpoints specify a "-t <timeout in ms>" option. The value of the timeout to use depends on your network, you could for example use 30s as a start.

If the problem still occurs after configuring timeouts, could you enable some additional tracing on the nodes and registries to trace network connections with Ice.Trace.Network=2? I would also like to see the traces for the registries in addition to the node (make sure to enable node session tracing in the registries configuration files with IceGrid.Registry.Trace.Node=1).

The node and the registry are using 2 TCP/IP connections to communicate: the registry establishes a connection to the node and the node also establishes its own connection for calling on the registry. One of the connection being in a half-open state doesn't necessarily imply that the other will be in the same state... it's possible the connection failed and was re-established for example. This could explain why the registry fails to talk to the node but the node succeeds.

Cheers,
Benoit.

robert.vazan · July 2013

I've configured timeouts, but it doesn't seem to help. Here's my node timeout configuration:

IceGrid.Node.Endpoints=tcp -h 172.16.10.10 -t 10000
Ice.Default.Locator=WBPGrid/Locator:tcp -h 172.16.0.1 -p 4061 -t 10000:tcp -h 172.16.3.10 -p 4061 -t 10000

And this is registry timeout configuration:

IceGrid.Registry.Client.Endpoints=tcp -h 172.16.0.1 -p 4061 -t 10000
IceGrid.Registry.Server.Endpoints=tcp -h 172.16.0.1 -t 10000
IceGrid.Registry.Internal.Endpoints=tcp -h 172.16.0.1 -p 4063 -t 10000

We have OpenVPN network connecting servers in multiple locations. All servers connect via UDP to central OpenVPN server, which mediates all communication on the network. This pretty much rules out VPN-related issues since those would bring down the whole network rather than single node. Server with the failing node is unique in that it is virtual and its connection to the OpenVPN server goes through VirtualBox's builtin NAT. Nevertheless, as mentioned before, IceGrid servers on the machine continue to function even when the node process is down, which rules out issues with the VPN connection.

I think that half-open TCP connections are an unlikely cause. TCP times out after 2 minutes if it doesn't receive ACK packet to any sent data. Since there are keepalive messages flowing constantly from node to the registry, there must be a corresponding flow of ACK packets from registry to the node. That pretty much rules out all half-open connection scenarios. Connection issues also cannot explain why the node doesn't log anything for 20 minutes.

As for the secondary connection from registry to the node, it seems to me that this connection is completely down during outages, because that's why the registry reports the node as being down. Since it is down, registry must be regularly retrying this connection, which again rules out half-open connection issues. Retrying is apparent from the fact, that the node is rediscovered rapidly once it recovers after 20 minutes.

I've enabled network tracing on the failing node and primary registry and I will be posting traces once I observe another outage.

benoit · July 2013

robert.vazan wrote: »

I think that half-open TCP connections are an unlikely cause. TCP times out after 2 minutes if it doesn't receive ACK packet to any sent data. Since there are keepalive messages flowing constantly from node to the registry, there must be a corresponding flow of ACK packets from registry to the node. That pretty much rules out all half-open connection scenarios. Connection issues also cannot explain why the node doesn't log anything for 20 minutes.

Keep in mind that there are two TCP/IP connections, one in each direction. We can't completely rule out that one connection could be in such a state even if the other is fine.

The node not logging anything for 20 minutes would imply that the thread responsible for the session with the Master registry is hanging. I don't see where it could be hanging other than hanging on a two-way synchronous call to the master registry. Such a hang shouldn't occur if timeouts are set.

Are you still seeing such a hang after setting the timeouts? If you do, the next step will be to attach the debugger and get a thread dump of all the threads running in the IceGrid node when the hang occurs.

As for the secondary connection from registry to the node, it seems to me that this connection is completely down during outages, because that's why the registry reports the node as being down. Since it is down, registry must be regularly retrying this connection, which again rules out half-open connection issues. Retrying is apparent from the fact, that the node is rediscovered rapidly once it recovers after 20 minutes.

The registry doesn't regularly retry connecting to the node on its own, it might if you perform regular checks however using the IceGrid administrative interface (a "node ping" will cause a connection establishment attempt if the "registry to node" connection isn't already established).

I've enabled network tracing on the failing node and primary registry and I will be posting traces once I observe another outage.

Ok, hopefully this will provide additional information that will help us to better understand the cause of this issue.

Cheers,
Benoit.

robert.vazan · July 2013

I've enabled network tracing in node and registry and I've captured logs during outage. I cannot see any additional evidence in the logs, but perhaps you can see some hint in there.

Below is an abbreviated node log during outage. Apparently the node ceases to send keepalive messages for 1 minute, then closes connections to registries and remains quiet for 20 minutes. 20 minutes later it closes connections to local servers and then recovers by renewing all connections.

... keepalive messages here ...
-- 07/09/13 12:00:04.318 ...: Replica: sending keep alive message to replica `Master'
-- 07/09/13 12:00:04.905 ...: Replica: sending keep alive message to replica `SlaveVNET'
-- 07/09/13 12:00:05.722 ...: Replica: sending keep alive message to replica `SlaveUS'
-- 07/09/13 12:00:07.065 ...: Replica: sending keep alive message to replica `SlaveUK'
-- 07/09/13 12:01:02.134 ...: Network: closing tcp connection
local address = 172.16.10.10:49998
remote address = 172.16.0.1:4063
-- 07/09/13 12:01:03.446 ...: Network: closing tcp connection
local address = 172.16.10.10:49992
remote address = 172.16.9.26:4063
-- 07/09/13 12:01:05.978 ...: Network: closing tcp connection
local address = 172.16.10.10:49994
remote address = 172.16.8.22:4063
-- 07/09/13 12:01:06.764 ...: Network: closing tcp connection
local address = 172.16.10.10:49996
remote address = 172.16.3.10:4063
-- 07/09/13 12:01:43.873 ...: Network: closing tcp connection
local address = 172.16.10.10:49350
remote address = 172.16.0.1:50158
-- 07/09/13 12:20:13.754 ...: Network: closing tcp connection
local address = 127.0.0.1:50001
remote address = 127.0.0.1:49363
-- 07/09/13 12:20:13.755 ...: Network: closing tcp connection
local address = 127.0.0.1:50000
remote address = 127.0.0.1:49359
-- 07/09/13 12:20:18.441 ...: Replica: sending keep alive message to replica `Master'
-- 07/09/13 12:20:18.441 ...: Network: trying to establish tcp connection to 172.16.0.1:4063
-- 07/09/13 12:20:18.486 ...: Network: tcp connection established
local address = 172.16.10.10:50024
remote address = 172.16.0.1:4063
-- 07/09/13 12:20:18.583 ...: Replica: lost session with replica `Master':
Outgoing.cpp:443: Ice::ObjectNotExistException:
object does not exist:
identity: `1a755a87-28b4-4285-96f2-72167012c395'
facet:
operation: keepAlive
-- 07/09/13 12:20:18.583 ...: Replica: trying to establish session with replica `Master'
-- 07/09/13 12:20:18.587 ...: Network: trying to establish tcp connection to 172.16.0.1:4063
-- 07/09/13 12:20:18.640 ...: Network: tcp connection established
local address = 172.16.10.10:50025
remote address = 172.16.0.1:4063
-- 07/09/13 12:20:18.876 ...: Replica: established session with replica `Master'
-- 07/09/13 12:20:18.993 ...: Network: accepted tcp connection
local address = 172.16.10.10:49350
remote address = 172.16.0.1:51307
......
-- 07/09/13 12:20:19.066 ...: Network: accepted tcp connection
local address = 172.16.10.10:49350
remote address = 172.16.0.1:51308
... similar messages for slave registries ...
-- 07/09/13 12:20:29.919 ...: Network: trying to establish tcp connection to 127.0.0.1:49359
-- 07/09/13 12:20:29.923 ...: Network: tcp connection established
local address = 127.0.0.1:50032
remote address = 127.0.0.1:49359
... keepalive messages ...
-- 07/09/13 12:20:47.296 ...: Network: trying to establish tcp connection to 127.0.0.1:49363
-- 07/09/13 12:20:47.296 ...: Network: tcp connection established
local address = 127.0.0.1:50033
remote address = 127.0.0.1:49363
... keepalive messages ...
-- 07/09/13 12:21:22.758 ...: Network: closing tcp connection
local address = 172.16.10.10:50025
remote address = 172.16.0.1:4063
-- 07/09/13 12:21:22.764 ...: Network: closing tcp connection
local address = 172.16.10.10:49350
remote address = 172.16.0.1:51307
... similar messages for slave registries ...
... keepalive messages ...

Registry log messages related to the failing node are below. Time is synchronized via NTP with at least 1 second precision. Interestingly, there are often delays of several seconds between corresponding events on node and registry. Apparently the registry judges the node to be down about 50 seconds after receiving last keepalive message. It then proceeds to close two connections to the node. From the timing, it looks like one connection is closed by registry and one by node. Registry then considers the node alive again after it receives two connection requests from the node.

-- 07/09/13 12:00:56.575 Registry: Node: node `UK' down
...
-- 07/09/13 12:01:07.792 Registry: Network: closing tcp connection
local address = 172.16.0.1:4063
remote address = 172.16.10.10:49998
...
-- 07/09/13 12:01:37.793 Registry: Network: closing tcp connection
local address = 172.16.0.1:50158
remote address = 172.16.10.10:49350
...
-- 07/09/13 12:20:22.084 Registry: Network: accepted tcp connection
local address = 172.16.0.1:4063
remote address = 172.16.10.10:50024
-- 07/09/13 12:20:22.227 Registry: Network: accepted tcp connection
local address = 172.16.0.1:4063
remote address = 172.16.10.10:50025
...
-- 07/09/13 12:20:22.276 Registry: Node: node `UK' up
...
-- 07/09/13 12:20:22.459 Registry: Network: trying to establish tcp connection to 172.16.10.10:49350
-- 07/09/13 12:20:22.521 Registry: Network: tcp connection established
local address = 172.16.0.1:51307
remote address = 172.16.10.10:49350
...
-- 07/09/13 12:20:22.541 Registry: Server: loading `UK.ShA' on node `UK'
-- 07/09/13 12:20:22.541 Registry: Network: trying to establish tcp connection to 172.16.10.10:49350
-- 07/09/13 12:20:22.594 Registry: Network: tcp connection established
local address = 172.16.0.1:51308
remote address = 172.16.10.10:49350
...
-- 07/09/13 12:20:22.698 Registry: Server: loaded `UK.ShA' on node `UK'
-- 07/09/13 12:20:22.698 Registry: Server: loading `UK.ShB' on node `UK'
-- 07/09/13 12:20:22.747 Registry: Server: loaded `UK.ShB' on node `UK'
...
-- 07/09/13 12:21:22.935 Registry: Network: closing tcp connection
local address = 172.16.0.1:4063
remote address = 172.16.10.10:50025
-- 07/09/13 12:21:22.946 Registry: Network: closing tcp connection
local address = 172.16.0.1:51307
remote address = 172.16.10.10:49350

robert.vazan · July 2013

Any hint on how to capture thread dump from the node? AFAIK it's all C++ code, which means no symbol data unless I compile from sources.

benoit · July 2013

Hi,

Those traces are quite puzzling. Can you confirm that you didn't remove any traces between these 2 node traces?

-- 07/09/13 12:00:04.318 ...: Replica: sending keep alive message to replica `Master'
...
-- 07/09/13 12:20:18.441 ...: Replica: sending keep alive message to replica `Master'

The fact that there's no warning between these 2 traces indicates that the first send keep alive succeeded. The network connection to the registry got closed by ACM (active connection management) around 1 minute after the first keep alive message got sent -- which is expected when the connection is inactive.

So this would imply that the thread slept for 20 minutes instead of the default 15s before sending the next keep alive.

Which Windows operating system version do you use? We have seen in the past timing issues with performance counters and virtual machines (Ice internally uses performance counters to provide a monotonic clock). However, I couldn't find any opened VirtualBox bugs related to this. Could you explain how you increased the system timer resolution?

To try to narrow down this problem and allow getting stack traces, I will send you by email a debug icegridnode executable. This executable will contain some additional tracing.

Cheers,
Benoit.

robert.vazan · July 2013

There are indeed no messages other than those listed. No warning/error about failed keepalive.

The guest OS is Win7 Pro 64-bit.

Timer resolution is increased via timeBeginPeriod API by multiple applications.

This is the only VirtualBox VM we are running. We have several ESXi VMs that run IceGrid nodes without any issues. Switching this VM to ESXi is not an option. I don't want to jump to conclusions about VirtualBox though. There are many other differences that affect this machine that could cause the problem.

IMO the use of QueryPerformanceCounter is somewhat risky due to bugs in both real hardware and virtual machines.

I've installed the debug version you sent me. I've also enabled additional tracing on the node as follows:

Ice.Trace.Retry=1
Ice.Trace.Protocol=2
Ice.Warn.Connections=1

Besides tons of zero-length waits due to Windows under-sleeping a few milliseconds each time, I am getting 1,200 second waits shortly before node outage.

-- 07/10/13 06:40:10.904 ...: Replica: waiting 1199 (s) (wakeTime = 608327, now = 607127)
-- 07/10/13 06:40:14.591 ...: Replica: waiting 1199 (s) (wakeTime = 608331, now = 607131)
-- 07/10/13 06:40:18.435 ...: Replica: waiting 1199 (s) (wakeTime = 608335, now = 607135)
-- 07/10/13 06:40:19.248 ...: Replica: waiting 1199 (s) (wakeTime = 608335, now = 607136)

The log messages don't say which registry is managed by the waiting thread, but I assume these four log messages correspond to our one master registry and three slave registries. So it's now clear that the node decides on its own that it is now time to sleep for 20 minutes. Looking at the code, I don't see any obvious reason how can it possibly happen. Quoting from SessionManager.h for Ice 3.5:

IceUtil::Time timeout = IceUtil::Time::seconds(10);
...
IceUtil::Time now = IceUtil::Time::now(IceUtil::Time::Monotonic);
IceUtil::Time wakeTime = now + timeout;

Apparently you are working with slightly different version of the code since the logs show 15-second delays instead of 10-second ones, but I assume the differences aren't too big. It looks like timeout variable gets modified. It is being passed as a non-const reference to createSession method and appears to be modified in NodeSessionKeepAliveThread::createSessionImpl:

session = _node->registerWithRegistry(registry);
int t = session->getTimeout();
if(t > 0)
{
timeout = IceUtil::Time::seconds(t / 2);
}

I am not sure whether I am reading the code right. Nor am I sure what is this code supposed to do nor why is it triggering on only one of our nodes. Please check it. I think we are very close to resolution of this whole issue.

benoit · July 2013

Hi,

Glad to hear we're getting to the bottom of this issue

.

Do you have few more "waiting" traces shortly before the traces showing the 1199s wait? The tracing clearly points to a problem with the monotonic time returned by IceUtil::Time::now(). I guess the next step is to write a small test case to demonstrate it and eventually identify the source of the bug. I've attached the source code of a small test which is very similar to the wait implemented in the IceGrid node. You should be able to compile it by executing "cl.exe test.cpp" (rename test.cpp.txt to test.cpp). You can then run test.exe and capture its output to see if you also get 1999s waits.

The Ice 3.5 source code you're looking at is the same as the code used to build the IceGrid node. The 10 seconds timeout is only used for session establishment, once the session is established the timeout is set to IceGrid.Registry.NodeSessionTimeout / 2 (which defaults to 30s / 2 = 15s).

To my knowledge, performance counters are the only available high resolution monotonic timers on Windows. GetTickCount/GetTickCount64 resolution is not very good (10-16ms according to the documentation) compared to performance counters. While there used to be some issues with performance counters on Windows XP or some old VMware versions (with some specific hardware), those issues appear to have been fixed since a long time. As of today, we're not aware of any problems with the use of performance counters on Windows.

If the problem is related to performance counters on VirtualBox, I guess you could re-compile the IceUtil library to not use them (this can be done by commenting out line 61 of cpp/src/IceUtil/Time.cpp), ftime will be used instead but note that it's not a configuration we officially support.

Cheers,
Benoit.

robert.vazan · July 2013

Ok, I jumped to conclusions too quickly. Here's a more complete log:

-- 07/10/13 06:39:33.322 ...: Replica: waiting 15 (s) (wakeTime = 608305, now = 608290)
-- 07/10/13 06:39:34.018 ...: Replica: waiting 0 (s) (wakeTime = 608290, now = 608290)
-- 07/10/13 06:39:34.059 ...: Replica: waiting 15 (s) (wakeTime = 608305, now = 608290)
-- 07/10/13 06:39:40.440 ...: Replica: waiting 15 (s) (wakeTime = 608312, now = 608297)
-- 07/10/13 06:39:43.977 ...: Replica: waiting 15 (s) (wakeTime = 608316, now = 608301)
-- 07/10/13 06:39:48.108 ...: Replica: waiting 15 (s) (wakeTime = 608320, now = 608305)
-- 07/10/13 06:39:48.984 ...: Replica: waiting 15 (s) (wakeTime = 608320, now = 608305)
-- 07/10/13 06:39:56.121 ...: Replica: waiting 0 (s) (wakeTime = 608312, now = 608312)
-- 07/10/13 06:39:56.197 ...: Replica: waiting 15 (s) (wakeTime = 608327, now = 608312)
-- 07/10/13 06:39:59.623 ...: Replica: waiting 0 (s) (wakeTime = 608316, now = 608316)
-- 07/10/13 06:39:59.775 ...: Replica: waiting 15 (s) (wakeTime = 608331, now = 608316)
-- 07/10/13 06:40:03.459 ...: Replica: waiting 0 (s) (wakeTime = 608320, now = 608320)
-- 07/10/13 06:40:03.505 ...: Replica: waiting 15 (s) (wakeTime = 608335, now = 608320)
-- 07/10/13 06:40:04.247 ...: Replica: waiting 0 (s) (wakeTime = 608320, now = 608320)
-- 07/10/13 06:40:04.293 ...: Replica: waiting 15 (s) (wakeTime = 608335, now = 608320)
-- 07/10/13 06:40:10.904 ...: Replica: waiting 1199 (s) (wakeTime = 608327, now = 607127)
-- 07/10/13 06:40:14.591 ...: Replica: waiting 1199 (s) (wakeTime = 608331, now = 607131)
-- 07/10/13 06:40:18.435 ...: Replica: waiting 1199 (s) (wakeTime = 608335, now = 607135)
-- 07/10/13 06:40:19.248 ...: Replica: waiting 1199 (s) (wakeTime = 608335, now = 607136)
-- 07/10/13 07:00:15.316 ...: Replica: waiting 15 (s) (wakeTime = 608347, now = 608332)
-- 07/10/13 07:00:16.602 ...: Replica: waiting 15 (s) (wakeTime = 608348, now = 608333)
-- 07/10/13 07:00:19.941 ...: Replica: waiting 15 (s) (wakeTime = 608351, now = 608336)
-- 07/10/13 07:00:20.059 ...: Replica: waiting 15 (s) (wakeTime = 608352, now = 608337)
-- 07/10/13 07:00:20.126 ...: Replica: waiting 15 (s) (wakeTime = 608352, now = 608337)
-- 07/10/13 07:00:35.253 ...: Replica: waiting 0 (s) (wakeTime = 608351, now = 608351)
-- 07/10/13 07:00:35.299 ...: Replica: waiting 15 (s) (wakeTime = 608367, now = 608352)
-- 07/10/13 07:00:35.360 ...: Replica: waiting 0 (s) (wakeTime = 608352, now = 608352)
-- 07/10/13 07:00:35.420 ...: Replica: waiting 0 (s) (wakeTime = 608352, now = 608352)
-- 07/10/13 07:00:35.435 ...: Replica: waiting 15 (s) (wakeTime = 608367, now = 608352)
-- 07/10/13 07:00:35.572 ...: Replica: waiting 15 (s) (wakeTime = 608367, now = 608352)
-- 07/10/13 07:00:35.663 ...: Replica: waiting 0 (s) (wakeTime = 608352, now = 608352)

Apparently the now value jumps back 20 minutes. I've checked and it looks like VirtualBox uses standard HPET frequency of 3.579Mhz, which means that the lower 32 bits of the counter overflow every 20 minutes. I've found further evidence that VirtualBox doesn't always increment the upper 32 bits of the counter:

The Awesome Factor: Monotonic timers (search for VirtualBox keyword)

Case closed. I will submit bug report to VirtualBox. The issue seems to be already known to VirtualBox developers, but it's worth a try. We will be testing VMWare Player, because it has some features that ESXi doesn't provide and it could serve as a replacement for VirtualBox. Running modified Ice doesn't make sense since it takes ages to even discover the root cause of the issue and even more time to maintain custom build/configuration.

IMO you should just use GetTickCount64 for it is reliable and accurate enough for Ice/IceGrid needs. QPC just cuts too many corners - consider effects of long hibernation, moving hibernated disk to another machine, migrating VM to different host. You can also keep QPC and at least implement the workaround from above mentioned blog post to cover the VirtualBox case.

robert.vazan · July 2013

VirtualBox bug report is linked below. So far no response from VirtualBox developers.

https://www.virtualbox.org/ticket/11951

VMWare Player is not an option for us. We are stuck with VirtualBox. Either VirtualBox fixes the problem or ZeroC implements workaround or we buy hardware capable of replacing the fancy VirtualBox setup we are using.

Anyway thanks for helping me to track down the issue.

benoit · July 2013

Hi Robert,

As mentioned in my latest post, you could disable the use of QPC by commenting out line 61 from cpp/src/IceUtil/Time.cpp and re-compile the IceUtil library. It will rely on ftime() instead. Since it's not monotonic, it might however cause similar issues if you move the time backward... Since we provide the source, you could also look into implementing a better workaround based on GetTickCount64, it shouldn't be very difficult.

Cheers,
Benoit.

Archived

IceGrid Node process experiences strange 20-minute outages at random times

Comments

Categories