Archived

This forum has been archived. Please start a new discussion on GitHub.

IceGrid on-demand takes a while to restart

I've got a client written in Java talking to a set of servers written in C++. I'm managing the whole thing through IceGrid, and all of the C++ servers are set to start on-demand.

The first time that the Java client accesses the servers, they start up as required and communication goes fine. Similarly, once everything is connected, things go great. However, if I stop one of the C++ servers, the next time that the Java client needs to communicate with that server, it waits a long time -- 5-10 seconds or so, as far as I can tell -- before the stopped server is restarted. Then, once the communication is established again, messages flow properly.

To me, this feels like the Java server has some sort of stale handle to the server, and the delay that I'm seeing is the time that it takes for that stale handle to time out and the system to create a new connection. Is this plausible? Could it be that the C++ servers aren't shutting down cleanly in some way, and that's what's triggering this?

Basically, what sort of logging should I enable to track in more detail what's going on here. It's not a show-stopper (as once the connection is made, everything is fine), but it is pretty annoying.

Thanks,

MEF

Comments

  • More details: old port numbers hanging around

    Okay, I did a bit more investigation. It looks like the server shuts down properly, but the Java client still tries to connect to the "old" port the next time it tries to use it. That takes 20 seconds or so to time out, and then it realises that it won't work and asks the IceGrid locator for a "new" port instead. Both computers involved are running Windows, if that matters, and the Java on the client side is version 1.5.0_08.

    Here's an excerpt from the output of the server (Ice.Trace.Network=3). Note the 23-second gap between the server shutting down and restarting.
    [ 10/23/06 15:02:20.000 Blinker: Network: stopping to accept tcp connections at 131.159.60.138:2368 ]
    [ 10/23/06 15:02:20.000 Blinker: Network: stopping to accept tcp connections at 127.0.0.1:2369 ]
    [ 10/23/06 15:02:20.000 Blinker: Network: shutting down tcp connection for writing
    local address = 131.159.60.138:2373
    remote address = 131.159.60.165:1330 ]
    [ 10/23/06 15:02:20.000 Blinker: Network: closing tcp connection
    local address = 131.159.60.138:2373
    remote address = 131.159.60.165:1330 ]
    [ 10/23/06 15:02:20.000 Blinker: Network: shutting down tcp connection for writing
    local address = 131.159.60.138:2370
    remote address = 131.159.60.165:12000 ]
    [ 10/23/06 15:02:20.000 Blinker: Network: closing tcp connection
    local address = 131.159.60.138:2370
    remote address = 131.159.60.165:12000 ]
    [ 10/23/06 15:02:20.000 Blinker: Network: shutting down tcp connection for writing
    local address = 131.159.60.138:2368
    remote address = 131.159.60.165:3164 ]
    [ 10/23/06 15:02:20.000 Blinker: Network: closing tcp connection
    local address = 131.159.60.138:2368
    remote address = 131.159.60.165:3164 ]
    [ 10/23/06 15:02:20.062 c:/Ice-3.1.1/bin/icegridnode.exe: Activator: detected termination of server `Blinker' ]
    [ 10/23/06 15:02:43.562 c:/Ice-3.1.1/bin/icegridnode.exe: Activator: activating server `Blinker' ]
    [ 10/23/06 15:02:43.937 Blinker: Network: attempting to bind to tcp socket 131.159.60.138:0 ]
    [ 10/23/06 15:02:43.937 Blinker: Network: accepting tcp connections at 131.159.60.138:2385 ]
    [ 10/23/06 15:02:43.937 Blinker: Network: attempting to bind to tcp socket 127.0.0.1:0 ]
    [ 10/23/06 15:02:43.937 Blinker: Network: accepting tcp connections at 127.0.0.1:2386 ]
    # ... and normal start-up from this point

    Here's the corresponding parts of the client log. I'm using a custom logger in the client, so the format is a bit different, but you can still see what's going on, I hope. I've added comments at relevant places. Note that the clocks on the two computers are about 6 seconds out synch; hopefully that's not the issue.
    <!-- Old server shutting down -->
    <record>
    <date>2006-10-23T15:02:26</date>
    <millis>1161608546859</millis>
    <message>Protocol: received close connection
    message type = 4 (close connection)
    compression status = 1 (not compressed; compress response, if any)
    message size = 14</message>
    </record>
    <record>
    <date>2006-10-23T15:02:26</date>
    <millis>1161608546875</millis>
    <message>Network: shutting down tcp connection for writing
    local address = 131.159.60.165:3164
    remote address = 131.159.60.138:2368</message>
    </record>
    <record>
    <date>2006-10-23T15:02:26</date>
    <millis>1161608546875</millis>
    <message>Network: closing tcp connection
    local address = 131.159.60.165:3164
    remote address = 131.159.60.138:2368</message>
    </record>
    <!-- ... various other things -->
    <!-- Try to connect to old address! -->
    <record>
    <date>2006-10-23T15:02:29</date>
    <millis>1161608549390</millis>
    <message>Network: trying to establish tcp connection to 131.159.60.138:2368</message>
    </record>
    <!-- Timed out (after 21 seconds!); ask the IceGrid locator to find a new endpoint -->
    <record>
    <date>2006-10-23T15:02:50</date>
    <millis>1161608570421</millis>
    <message>Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 71
    request id = 9
    identity = JastIceGrid/Locator
    facet =
    operation = findObjectById
    mode = 1 (nonmutating)
    context = </message>
    </record>
    <record>
    <date>2006-10-23T15:02:50</date>
    <millis>1161608570437</millis>
    <message>Network: sent 71 of 71 bytes via tcp
    local address = 131.159.60.165:3145
    remote address = 131.159.60.165:12000</message>
    </record>
    <record>
    <date>2006-10-23T15:02:50</date>
    <millis>1161608570828</millis>
    <message>Network: received 14 of 14 bytes via tcp
    local address = 131.159.60.165:3145
    remote address = 131.159.60.165:12000</message>
    </record>
    <record>
    <date>2006-10-23T15:02:50</date>
    <millis>1161608570828</millis>
    <message>Network: received 56 of 56 bytes via tcp
    local address = 131.159.60.165:3145
    remote address = 131.159.60.165:12000</message>
    </record>
    <record>
    <date>2006-10-23T15:02:50</date>
    <millis>1161608570828</millis>
    <message>Protocol: received reply
    message type = 2 (reply)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 70
    request id = 9
    reply status = 0 (ok)</message>
    </record>
    <!-- Connect to new address -->
    <record>
    <date>2006-10-23T15:02:50</date>
    <millis>1161608570875</millis>
    <message>Network: trying to establish tcp connection to 131.159.60.138:2385</message>
    </record>
    <record>
    <date>2006-10-23T15:02:50</date>
    <millis>1161608570984</millis>
    <message>Network: tcp connection established
    local address = 131.159.60.165:3170
    remote address = 131.159.60.138:2385</message>
    </record>
    <record>
    <date>2006-10-23T15:02:51</date>
    <millis>1161608571000</millis>
    <message>Network: received 14 of 14 bytes via tcp
    local address = 131.159.60.165:3170
    remote address = 131.159.60.138:2385</message>
    </record>
    <record>
    <date>2006-10-23T15:02:51</date>
    <millis>1161608571000</millis>
    <message>Protocol: received validate connection
    message type = 3 (validate connection)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 14</message>
    </record>
    <!-- Successful communication! -->
    <record>
    <date>2006-10-23T15:02:51</date>
    <millis>1161608571000</millis>
    <message>Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 46
    request id = 0 (oneway)
    identity = Blinker
    facet =
    operation = setBlink
    mode = 0 (normal)
    context = </message>
    </record>
    <record>
    <date>2006-10-23T15:02:51</date>
    <millis>1161608571000</millis>
    <message>Network: sent 46 of 46 bytes via tcp
    local address = 131.159.60.165:3170
    remote address = 131.159.60.138:2385</message>
    </record>
  • benoit
    benoit Rennes, France
    Hi,

    The Ice runtime in the client caches the endpoints of the server to avoid re-contacting the locator when it needs to resolve the endpoints of an object adapter. That's why it tries to connect to the old address first and then if the connection attempt fails, the client asks the locator the up-to-date endpoints of the object adapter.

    You can disable this cache with the Ice.Default.LocatorCacheTimeout property, see the Ice manual for more information on this property.

    I don't know why the connection establishment attempt takes so long to fail on your machine. Is it only taking time in this particular scenario? Or is it always taking that long to fail when the server isn't listening on a given port?

    Cheers,
    Benoit.
  • Thanks!

    Setting Ice.Default.LocatorCacheTimeout does seem to have fixed my problem. This stale-handle issue seems only to happen between this client and these servers; I haven't yet had to set that property on any other communicator in the system. It could be something to do with Windows networking/firewalls, or Java, or the fact that the servers in question are also using an external event loop and maybe are shutting down strangely ...

    Anyway, the problem is solved now, and unfortunately I don't have time to go into any more investigation as to the underlying cause.

    Thanks for your help!

    MEF
  • Returning to this issue nearly a year later: I finally figured out what was going on! It was the Windows firewall ... when I gave my servers on the Windows machine specific endpoints and opened those ports to the local network through the firewall, everything works just as it should.

    Just in case anyone else ever sees this. :)

    MEF