IceGrid on-demand takes a while to restart

in Help Center
I've got a client written in Java talking to a set of servers written in C++. I'm managing the whole thing through IceGrid, and all of the C++ servers are set to start on-demand.
The first time that the Java client accesses the servers, they start up as required and communication goes fine. Similarly, once everything is connected, things go great. However, if I stop one of the C++ servers, the next time that the Java client needs to communicate with that server, it waits a long time -- 5-10 seconds or so, as far as I can tell -- before the stopped server is restarted. Then, once the communication is established again, messages flow properly.
To me, this feels like the Java server has some sort of stale handle to the server, and the delay that I'm seeing is the time that it takes for that stale handle to time out and the system to create a new connection. Is this plausible? Could it be that the C++ servers aren't shutting down cleanly in some way, and that's what's triggering this?
Basically, what sort of logging should I enable to track in more detail what's going on here. It's not a show-stopper (as once the connection is made, everything is fine), but it is pretty annoying.
Thanks,
MEF
The first time that the Java client accesses the servers, they start up as required and communication goes fine. Similarly, once everything is connected, things go great. However, if I stop one of the C++ servers, the next time that the Java client needs to communicate with that server, it waits a long time -- 5-10 seconds or so, as far as I can tell -- before the stopped server is restarted. Then, once the communication is established again, messages flow properly.
To me, this feels like the Java server has some sort of stale handle to the server, and the delay that I'm seeing is the time that it takes for that stale handle to time out and the system to create a new connection. Is this plausible? Could it be that the C++ servers aren't shutting down cleanly in some way, and that's what's triggering this?
Basically, what sort of logging should I enable to track in more detail what's going on here. It's not a show-stopper (as once the connection is made, everything is fine), but it is pretty annoying.
Thanks,
MEF
0
Comments
Okay, I did a bit more investigation. It looks like the server shuts down properly, but the Java client still tries to connect to the "old" port the next time it tries to use it. That takes 20 seconds or so to time out, and then it realises that it won't work and asks the IceGrid locator for a "new" port instead. Both computers involved are running Windows, if that matters, and the Java on the client side is version 1.5.0_08.
Here's an excerpt from the output of the server (Ice.Trace.Network=3). Note the 23-second gap between the server shutting down and restarting.
Here's the corresponding parts of the client log. I'm using a custom logger in the client, so the format is a bit different, but you can still see what's going on, I hope. I've added comments at relevant places. Note that the clocks on the two computers are about 6 seconds out synch; hopefully that's not the issue.
The Ice runtime in the client caches the endpoints of the server to avoid re-contacting the locator when it needs to resolve the endpoints of an object adapter. That's why it tries to connect to the old address first and then if the connection attempt fails, the client asks the locator the up-to-date endpoints of the object adapter.
You can disable this cache with the Ice.Default.LocatorCacheTimeout property, see the Ice manual for more information on this property.
I don't know why the connection establishment attempt takes so long to fail on your machine. Is it only taking time in this particular scenario? Or is it always taking that long to fail when the server isn't listening on a given port?
Cheers,
Benoit.
Setting Ice.Default.LocatorCacheTimeout does seem to have fixed my problem. This stale-handle issue seems only to happen between this client and these servers; I haven't yet had to set that property on any other communicator in the system. It could be something to do with Windows networking/firewalls, or Java, or the fact that the servers in question are also using an external event loop and maybe are shutting down strangely ...
Anyway, the problem is solved now, and unfortunately I don't have time to go into any more investigation as to the underlying cause.
Thanks for your help!
MEF
Just in case anyone else ever sees this.
MEF