Help Determine the Cause of Spontaneous Disconnections between a ZeroC Client and Server

pheiffpheiff Member Stephen PheifferOrganization: NISTProject: NICE
edited August 2016 in Help Center

On a regular basis a client disconnects spontaneously from a server running ZeroC ICE. Our current program/setup is used in many locations and only has problems in one place. My hope is by viewing the logged exceptions/stacktraces emitted by ZeroC ICE, you might be able to give us some insights into the problem, which may well be specific to a certain computer or the local network.

I have provided many details below to frame the problem. Let me know if you need anything else. There is already so much information to send that I didn't want to include anything extra unless asked.

Thanks!

**Our Facility
**
At our facility we have a number of identical servers which run on RHEL7 and are written in Java. We also have a number of identical clients, in fixed locations, written in Java which connect to the servers. Each client runs on either RHEL7 OR Windows and only connects to a specific server.

Connections between client and server happen through ZeroC 3.6 using Glacier2. We also use IceStorm to allow servers to execute "callbacks" on the connected clients (this is initiated through IceBox). Our Clients and Servers use the 3.6 ACM system.

**Configuration
**
Each client and server is configured identically.

_Here is the client code which configures ACM in the client:
_
int ACMTimeout = sessionPrx.getACMTimeout();
router.ice_getConnection().setACM(new IntOptional(ACMTimeout), new Optional<>(ACMClose.CloseOff),
new Optional<>(ACMHeartbeat.HeartbeatAlways));

See attached server config files.

**Problem
**
The problem only occurs for one specific client-server pair. The client spontaneously disconnects from the server (see log messages below).

Problem Server located at IPX and runs under RHEL7.
Problem Client located at IPY and runs under Windows.

**Client-side log messages
**
The messages written, client-side as configured by calling the method "properties.setProperty("Ice.LogFile",...)":

**Server Logs
**
The messages which the icebox and glacier processes wrote to stdout/stderr at the time of the problem.

Best Answer

  • pheiffpheiff Stephen PheifferOrganization: NISTProject: NICE
    Accepted Answer

    I'm on here for other issues, but realize this is still open. This was solved, by replacing the network adapter on the computer. This was likely a hardware problem.

Answers

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Hi,

    Your client and server logs show that the ACM system from the Glacier2 router closed the connection to the client (the Ice::ConnectionTimeoutException is raised in such circumstances). This usually occurs because it didn't receive any heartbeats from the client and therefore assumes the connection has been interrupted for some reasons between the two peers.

    It's hard to say why this disconnection occurred, this can be either a hardware of software issue on one of the 2 systems or between. Did you check the system logs to see if there was additional tracing at the time of the disconnection? Are the 2 machines on the same physical network?

    The timing from the server and client logs don't appear to match. It would be interesting to see the traces of both Glacier2 and the client when the problem occurs. If there isn't too much activity between the 2 peers you could also enable protocol tracing on both sides to see the heartbeats (with Ice.Trace.Protocol=1). I would also enable Ice.Warn.Connections=1 for the Glacier2 router.

    Cheers,
    Benoit.

  • pheiffpheiff Member Stephen PheifferOrganization: NISTProject: NICE

    thanks for looking into this. I sort expected it has something to do with ACM, given the error messages, but hopefully we can figure out why.

    As far as timing goes, it looks seems like the timing matched. The root issue happens at 9:32. On the server there is an Ice.ConnectionTimeoutException and on the client there is an Ice.ConnectionLostException.

    As for tracing we actually did have that enabled in the past and it appears to have been turned off. I'll turn this back on.

    Other thoughts: I believe the 2 machines are on the same switch (they a physically located near each other). I can try to get lower level logs, is there anything specific you would want to see?

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Hi,

    It's a bit surprising that the both the client and server show the exceptions at the same time. Is it really the case or is there a small delay between the two traces showing the exceptions? For instance, is the Ice::ConnectionLostException showing up before or after the Ice::ConnectionTimeoutException?

    If the Ice::ConnectionLostException is showing up shortly after the Ice::ConnectionTimeoutException this would indicate that the connection was probably not dead but that for some reasons the heartbeats stopped being sent.

    It would be good to see the protocol tracing (with Ice.Trace.Protocol=1) to further investigate this.

    Cheers,
    Benoit.

  • pheiffpheiff Member Stephen PheifferOrganization: NISTProject: NICE

    Hi, just picking this thread back up. At this point we continue to get regular problems (about once every couple days), but only on a single client/server pair (all other setups in our building work fine).

    We have tried logging a number of times, but there are so many messages it cripples the system. We have also tried setting the timeout period to 2 minutes on the server and then lying to the client and setting timeout to 20 seconds. Even sending the heartbeat 6 times faster than required is not enough to prevent the problem.

    What is the best route forward to debug this?

    Here is a typical message:
    glacier2router: warning: dispatch exception: ConnectionI.cpp:598: Ice::ConnectionTimeoutException: connection has timed out identity: pR_W[_yJ4Dj!;*9LSeqg/deviceMonitor facet: operation: changed remote host: 127.0.0.1 remote port: 33984 -! 02/28/17 14:02:31.186 glacier2router: warning: dispatch exception: ConnectionI.cpp:598: Ice::ConnectionTimeoutException: connection has timed out identity: pR_W[_yJ4Dj!;*9LSeqg/dataMonitor facet: operation: emit remote host: 127.0.0.1 remote port: 33984

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Hi,

    If protocol tracing emits too much tracing, could you try at least setting Ice.Trace.Network=2 for the server, IceBox,
    the Glacier2 router and the client and post the traces here?

    Did you upgrade to Ice 3.6.3 as well?

    Cheers,
    Benoit.

  • pheiffpheiff Member Stephen PheifferOrganization: NISTProject: NICE

    We are using version 3.6.3 currently. I turned on tracing level 2 as you suggested. I guess we should just wait until the problem happens again and then report what shows up in the log?

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Right, this additional logging should hopefully help to better understand the cause of the disconnection between the client and Glacier2.

    Cheers,
    Benoit.

  • pheiffpheiff Member Stephen PheifferOrganization: NISTProject: NICE
    Accepted Answer

    I'm on here for other issues, but realize this is still open. This was solved, by replacing the network adapter on the computer. This was likely a hardware problem.

Sign In or Register to comment.