Timeouts for unusual events

in Help Center
Zeroc, I''ve noticed that when the server has a catastrophic failure, such as a BSOD or network failure that an Ice invocation can appear to "hang" for a long time, possibly infinitely. One thing that I've done to re-create this type of scenario is by using a debugger on the hello client and server example breaking the execution of the hello server and then issuing the oneway and twoways in the client. If the server is "breaked" then the client will hang infinitely when making the invocation. I would have liked the client to timeout after 2000ms, the default in helloc, but I don't see this occuring. I actually create a thread in our real application that issues ice_pings to test the connection, but this test can "hang" to.
I'm wondering if this is a valid issue and a valid test and if you know of a way of resolving this.
Regards --Roland
I'm wondering if this is a valid issue and a valid test and if you know of a way of resolving this.
Regards --Roland
0
Comments
Are you using a proxy upon which the timeout has been set?
Regards, Matthew
Another test that I occasionally perform is pull the network cable on one end. Recently I did this to both my client and server system. When I pulled the cable on the client side, the connection closed in the expected 5 secs. However, when pulling the connection on the server side, the connection took much longer, like 30 secs. So I've started to try and re-create this sort of problem with a much simpler program, such as the hello world demo program, and using the debugger was my first attempt. I'm not sure at the moment if my test is valid. One thing, in the helloc/hellos example I was running on the same system.
Regards --Roland
I'll try to reproduce this. You could also add more tracing to see if this can give you some clues, try adding the following properties to your client configuration:
Benoit.
Thanks!
Benoit.
Thanks again --Roland
Here is what I learned. Ice will retry a proxy invocation or ice_ping depending on if Ice.RetryIntervals=-1 or some other value. The default = 0, which means retry once immediatey. This is good, but this doesn't appear to apply to SSL connections. In the case of SSL IceSSL.Client.Handshake. Retries applies and the default is 10. So if the timeout=5000 then by default a ice_ping on a SSL connection will take 50 secs (5*10) to return. What I've done is set Ice.RetryIntervals=-1 and IceSSL.Client.Handshake=2. This helps out a lot, but failures still take a long time to detect, since I've set timeouts to 5 secs. In my case the initial invocation can fail, which takes 5 secs, and then Ice interanlly appears to retry the connection twice, which accumulates to around 15-16 secs which previousely had been 50 secs with the default values. I set timeouts to 5 secs since we run our tests under extremely heavy loads and even at 5 secs sometimes invocations timeout so I'm not sure it is an option to reduce this. I'm also very concerned about seeting IceSSL.Client.Handshape=2 since the default is 10 and there was probably a good reason why this value was chosen.
Depending on the mode of failure, pulling the network cable on the client or server and whether running debug or release, things can take a different amount of time. For example, if I pull the cable on the client side, the ice_ping returns very quickly. However, pulling the cable on the server side in release mode (non debug), the return from the ice_ping can take 30-40 secs. In debug mode the ice-ping returns much quicker, around 10-15 secs.
I've also tried using ice_pings on oneway proxies and sometimes this will result in an immediate detection of a network failure, but more often I see many succesful returns from the oneway ice_ping and then after around twenty invocations an exception is returned. For example, if I pull the cable on the server, then I see many succesful ice_pings util an exception is reported. Pulling on the client side almost always results in an immediate response.
I would have liked to set the IceSSL.ClientHandshake=-1, but if I set this value to < 2, then I don't get any connection. so this doesn't appear to be an option. It would have been nice if only the initial connection to the server had the client handshake retry applied so I could get a more immediate response.
Other things about my environment. I've disable active connection management. I'm also using bi-dir connections.
Wondering what insight you might have to share on this. I'm not even sure that this is an Ice issue. I guess it could be in the Windows network stack or SSL too. In general I'm looking for a very quick and reliable method for detecting when a connection failure has occured in catastrophic failures, such as a network failure or blue-screen of death. One of the other tests I'm doing is stopping the server in the debugger to simulate a hung application. Perhaps I should just create a separate non SSL proxy to test, but I really wanted to test the actual connections. I'm also considering using a raw socket in this case, but I'm not sure if this will show better results and has some major disadvantages. Even if a raw socket worked better I woudl still have a potential issue with other ice invocations taking a while to return.
One other strange thing that I noticed in my program is that Ice doesn't appear to close down bi-dir connections if the client proxy and object adapter is deactivated and the server proxy (callback) is still around. I have to do an explicit close of connections in this case. So if I establish a bi-dir connection and then deactive the client object adapter and delete the client proxy the connection remains listed in tcpview. I need to invoke Ice::Connection::close() on the client too. Is this the expected behaviour. ACM is not enabled. I haven't created a program to demonstrate this outside of my program yet, but I'm pretty sure this is what I'm seeing.
Regards --Roland
- Shutdown of the communicator with Communicator::destroy()
- Active Connection Management
- The server closing the connection
- An explicit call to Connection::close()
It doesn't matter whether there are still proxies that use the connection. Since proxies can be recreated at any time, the connection cannot be closed based on a proxy count.The other questions will take a bit longer to respond to...
Regards --Roland
Regarding your SSL questions: Ice essentially treats OpenSSL as a black box, but we will investigate the retry and timeout issues you've mentioned. However, we're all a bit consumed with release testing at the moment, so your patience is appreciated.
Take care,
- Mark
Sorry for the late reply.
The retry logic in the Ice run time is handled in a transport-independent manner and applies equally well to TCP and SSL connections. The SSL plug-in has its own retry behavior, as you've seen, which is governed by the IceSSL.Client.Handshake.Retries property. If a timeout occurs while establishing an SSL connection, the SSL plug-in retries the specified number of times, and then the Ice core will retry if Ice.RetryIntervals allows it.
To be honest, I think the default value of 10 was an arbitrary choice, and I'm not convinced that the SSL plug-in even needs its own retry loop. We will investigate this further, and may change this behavior in the next major release.
This is a bug in the SSL plug-in. Let me know if you'd like a patch.
Also be aware that the SSL plug-in currently enforces a minimum connection timeout of five seconds. We may change this as well in the next release.
Take care,
- Mark
One of the other interesting things that I've uncovered is that if I pull the network cable on the server then when a deactivate the object adapter and wait for deactivation this can take a very long time. Well beyond the timeout it appears. If I break into the debugger after I pull the cable I'll often see waitForDeactivate in the callstack. I guess this isn't entirely unexpected as the Ice runtime trys to cleanup gracefully in this case and I probably had inflight data that was interrupted midsteam when I pulled the cable.
To get around this issue what I do is just deactivate the object adapter and then put waitForDeactivate in a separate thread which only does waitForDeactivate or both deactivate and waitForDeactivate as follows:
class AdapterDeactivationThread : public IceUtil::Thread
{
public:
AdapterDeactivationThread(const Ice::ObjectAdapterPtr adapter) :
m_adapter(adapter) {};
virtual void run()
{
m_adapter->deactivate();
m_adapter->waitForDeactivate();
};
private:
const Ice::ObjectAdapterPtr m_adapter;
};
The reason for this (probably craziness) is that I noticed that if deactivate was just invoked not all local resources/sockets were cleaned up. The call to waitForDeactivate was necessary to get all the resources completely cleaned up. I had disabled ACM due to the issue in 2.1.0 with oneways and ACM and closed connections. Perhaps if I enable ACM this would also clean-up, but it wasn't an option until 2.1.1
This "AdapterDeactivationThread" approach seems to be working unless after pulling the cable I attempt to immediately exit the application. In this case the application will wait for all threads to stop, which can still take a while on wairForDeactivate. So the app may sit around for a while looking like it is hung trying to exit. This is a pretty rare occurence so I'm not too worried about it.
I guess there was one other issue worth mentioning. If a connection has a tragic failure and you attempt to use prx->ice_connection->close() then close will often take a little while. Even if force was set to true.
The other issues I had were more related to my app. When a bad failure occured, I failed to track or propogate this in many cases, and would still attempt graceful closure. In this process I might do something silly (although it wasn't obvious to me when I originally wrote the code) like invoke a proxy, and since the network had already failed, this invocation would wait for a Ice timeout. So all these timeouts were accumulative and adding up to quite a bit, especially with retries and SSLs default settings. I've since made some changes to propogate these failures quicker.
I've also implemented the appliction level heartbeat mechanism that I mentioned earlier in this thread and this is working out well, but I still wonder if this was the best approach or not.
Regards --Roland