IceGrid on-demand takes a while to restart

mefoster · October 2006

I've got a client written in Java talking to a set of servers written in C++. I'm managing the whole thing through IceGrid, and all of the C++ servers are set to start on-demand.

The first time that the Java client accesses the servers, they start up as required and communication goes fine. Similarly, once everything is connected, things go great. However, if I stop one of the C++ servers, the next time that the Java client needs to communicate with that server, it waits a long time -- 5-10 seconds or so, as far as I can tell -- before the stopped server is restarted. Then, once the communication is established again, messages flow properly.

To me, this feels like the Java server has some sort of stale handle to the server, and the delay that I'm seeing is the time that it takes for that stale handle to time out and the system to create a new connection. Is this plausible? Could it be that the C++ servers aren't shutting down cleanly in some way, and that's what's triggering this?

Basically, what sort of logging should I enable to track in more detail what's going on here. It's not a show-stopper (as once the connection is made, everything is fine), but it is pretty annoying.

Thanks,

MEF

mefoster · October 2006

More details: old port numbers hanging around

Okay, I did a bit more investigation. It looks like the server shuts down properly, but the Java client still tries to connect to the "old" port the next time it tries to use it. That takes 20 seconds or so to time out, and then it realises that it won't work and asks the IceGrid locator for a "new" port instead. Both computers involved are running Windows, if that matters, and the Java on the client side is version 1.5.0_08.

Here's an excerpt from the output of the server (Ice.Trace.Network=3). Note the 23-second gap between the server shutting down and restarting.

[ 10/23/06 15:02:20.000 Blinker: Network: stopping to accept tcp connections at 131.159.60.138:2368 ]
[ 10/23/06 15:02:20.000 Blinker: Network: stopping to accept tcp connections at 127.0.0.1:2369 ]
[ 10/23/06 15:02:20.000 Blinker: Network: shutting down tcp connection for writing
local address = 131.159.60.138:2373
remote address = 131.159.60.165:1330 ]
[ 10/23/06 15:02:20.000 Blinker: Network: closing tcp connection
local address = 131.159.60.138:2373
remote address = 131.159.60.165:1330 ]
[ 10/23/06 15:02:20.000 Blinker: Network: shutting down tcp connection for writing
local address = 131.159.60.138:2370
remote address = 131.159.60.165:12000 ]
[ 10/23/06 15:02:20.000 Blinker: Network: closing tcp connection
local address = 131.159.60.138:2370
remote address = 131.159.60.165:12000 ]
[ 10/23/06 15:02:20.000 Blinker: Network: shutting down tcp connection for writing
local address = 131.159.60.138:2368
remote address = 131.159.60.165:3164 ]
[ 10/23/06 15:02:20.000 Blinker: Network: closing tcp connection
local address = 131.159.60.138:2368
remote address = 131.159.60.165:3164 ]
[ 10/23/06 15:02:20.062 c:/Ice-3.1.1/bin/icegridnode.exe: Activator: detected termination of server `Blinker' ]
[ 10/23/06 15:02:43.562 c:/Ice-3.1.1/bin/icegridnode.exe: Activator: activating server `Blinker' ]
[ 10/23/06 15:02:43.937 Blinker: Network: attempting to bind to tcp socket 131.159.60.138:0 ]
[ 10/23/06 15:02:43.937 Blinker: Network: accepting tcp connections at 131.159.60.138:2385 ]
[ 10/23/06 15:02:43.937 Blinker: Network: attempting to bind to tcp socket 127.0.0.1:0 ]
[ 10/23/06 15:02:43.937 Blinker: Network: accepting tcp connections at 127.0.0.1:2386 ]
# ... and normal start-up from this point

Here's the corresponding parts of the client log. I'm using a custom logger in the client, so the format is a bit different, but you can still see what's going on, I hope. I've added comments at relevant places. Note that the clocks on the two computers are about 6 seconds out synch; hopefully that's not the issue.

<record>
<date>2006-10-23T15:02:26</date>
<millis>1161608546859</millis>
<message>Protocol: received close connection
message type = 4 (close connection)
compression status = 1 (not compressed; compress response, if any)
message size = 14</message>
</record>
<record>
<date>2006-10-23T15:02:26</date>
<millis>1161608546875</millis>
<message>Network: shutting down tcp connection for writing
local address = 131.159.60.165:3164
remote address = 131.159.60.138:2368</message>
</record>
<record>
<date>2006-10-23T15:02:26</date>
<millis>1161608546875</millis>
<message>Network: closing tcp connection
local address = 131.159.60.165:3164
remote address = 131.159.60.138:2368</message>
</record>


<record>
<date>2006-10-23T15:02:29</date>
<millis>1161608549390</millis>
<message>Network: trying to establish tcp connection to 131.159.60.138:2368</message>
</record>

<record>
<date>2006-10-23T15:02:50</date>
<millis>1161608570421</millis>
<message>Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 71
request id = 9
identity = JastIceGrid/Locator
facet =
operation = findObjectById
mode = 1 (nonmutating)
context = </message>
</record>
<record>
<date>2006-10-23T15:02:50</date>
<millis>1161608570437</millis>
<message>Network: sent 71 of 71 bytes via tcp
local address = 131.159.60.165:3145
remote address = 131.159.60.165:12000</message>
</record>
<record>
<date>2006-10-23T15:02:50</date>
<millis>1161608570828</millis>
<message>Network: received 14 of 14 bytes via tcp
local address = 131.159.60.165:3145
remote address = 131.159.60.165:12000</message>
</record>
<record>
<date>2006-10-23T15:02:50</date>
<millis>1161608570828</millis>
<message>Network: received 56 of 56 bytes via tcp
local address = 131.159.60.165:3145
remote address = 131.159.60.165:12000</message>
</record>
<record>
<date>2006-10-23T15:02:50</date>
<millis>1161608570828</millis>
<message>Protocol: received reply
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 70
request id = 9
reply status = 0 (ok)</message>
</record>

<record>
<date>2006-10-23T15:02:50</date>
<millis>1161608570875</millis>
<message>Network: trying to establish tcp connection to 131.159.60.138:2385</message>
</record>
<record>
<date>2006-10-23T15:02:50</date>
<millis>1161608570984</millis>
<message>Network: tcp connection established
local address = 131.159.60.165:3170
remote address = 131.159.60.138:2385</message>
</record>
<record>
<date>2006-10-23T15:02:51</date>
<millis>1161608571000</millis>
<message>Network: received 14 of 14 bytes via tcp
local address = 131.159.60.165:3170
remote address = 131.159.60.138:2385</message>
</record>
<record>
<date>2006-10-23T15:02:51</date>
<millis>1161608571000</millis>
<message>Protocol: received validate connection
message type = 3 (validate connection)
compression status = 0 (not compressed; do not compress response, if any)
message size = 14</message>
</record>

<record>
<date>2006-10-23T15:02:51</date>
<millis>1161608571000</millis>
<message>Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 46
request id = 0 (oneway)
identity = Blinker
facet =
operation = setBlink
mode = 0 (normal)
context = </message>
</record>
<record>
<date>2006-10-23T15:02:51</date>
<millis>1161608571000</millis>
<message>Network: sent 46 of 46 bytes via tcp
local address = 131.159.60.165:3170
remote address = 131.159.60.138:2385</message>
</record>

benoit · October 2006

Hi,

The Ice runtime in the client caches the endpoints of the server to avoid re-contacting the locator when it needs to resolve the endpoints of an object adapter. That's why it tries to connect to the old address first and then if the connection attempt fails, the client asks the locator the up-to-date endpoints of the object adapter.

You can disable this cache with the Ice.Default.LocatorCacheTimeout property, see the Ice manual for more information on this property.

I don't know why the connection establishment attempt takes so long to fail on your machine. Is it only taking time in this particular scenario? Or is it always taking that long to fail when the server isn't listening on a given port?

Cheers,
Benoit.

mefoster · October 2006

Thanks!

Setting Ice.Default.LocatorCacheTimeout does seem to have fixed my problem. This stale-handle issue seems only to happen between this client and these servers; I haven't yet had to set that property on any other communicator in the system. It could be something to do with Windows networking/firewalls, or Java, or the fact that the servers in question are also using an external event loop and maybe are shutting down strangely ...

Anyway, the problem is solved now, and unfortunately I don't have time to go into any more investigation as to the underlying cause.

Thanks for your help!

MEF

mefoster · August 2007

Returning to this issue nearly a year later: I finally figured out what was going on! It was the Windows firewall ... when I gave my servers on the Windows machine specific endpoints and opened those ports to the local network through the firewall, everything works just as it should.

Just in case anyone else ever sees this.

MEF

Archived

IceGrid on-demand takes a while to restart

Comments

Categories