Timeouts for unusual events

rhochmuth · April 2005

Zeroc, I''ve noticed that when the server has a catastrophic failure, such as a BSOD or network failure that an Ice invocation can appear to "hang" for a long time, possibly infinitely. One thing that I've done to re-create this type of scenario is by using a debugger on the hello client and server example breaking the execution of the hello server and then issuing the oneway and twoways in the client. If the server is "breaked" then the client will hang infinitely when making the invocation. I would have liked the client to timeout after 2000ms, the default in helloc, but I don't see this occuring. I actually create a thread in our real application that issues ice_pings to test the connection, but this test can "hang" to.

I'm wondering if this is a valid issue and a valid test and if you know of a way of resolving this.

Regards --Roland

matthew · April 2005

rhochmuth wrote:

Zeroc, I''ve noticed that when the server has a catastrophic failure, such as a BSOD or network failure that an Ice invocation can appear to "hang" for a long time, possibly infinitely. One thing that I've done to re-create this type of scenario is by using a debugger on the hello client and server example breaking the execution of the hello server and then issuing the oneway and twoways in the client. If the server is "breaked" then the client will hang infinitely when making the invocation. I would have liked the client to timeout after 2000ms, the default in helloc, but I don't see this occuring. I actually create a thread in our real application that issues ice_pings to test the connection, but this test can "hang" to.

I'm wondering if this is a valid issue and a valid test and if you know of a way of resolving this.

Regards --Roland

Are you using a proxy upon which the timeout has been set?

Regards, Matthew

rhochmuth · April 2005

Hi Mathew, Yes, I am using a proxy with timeouts set. I usually set timeouts to 5 secs by default. If I shutdwon the Server system, via the Windows task manager, all is well, and the connection timeouts after 5 secs. So in a graceful shutdown all seems to be well. However, if the server blue-screens, things usually take much longer, sometimes a minute or two.

Another test that I occasionally perform is pull the network cable on one end. Recently I did this to both my client and server system. When I pulled the cable on the client side, the connection closed in the expected 5 secs. However, when pulling the connection on the server side, the connection took much longer, like 30 secs. So I've started to try and re-create this sort of problem with a much simpler program, such as the hello world demo program, and using the debugger was my first attempt. I'm not sure at the moment if my test is valid. One thing, in the helloc/hellos example I was running on the same system.

Regards --Roland

benoit · April 2005

Hi Roland,

I'll try to reproduce this. You could also add more tracing to see if this can give you some clues, try adding the following properties to your client configuration:

Ice.Trace.Retry=2
Ice.Trace.Network=2

Benoit.

benoit · April 2005

I'm afraid I can't reproduce what you're observing. It's quite possible actually that it takes quite some time for Ice to retry and eventually give up on the connection establishment (if for example your proxy contains several endpoints to the same host). The best way to figure out what Ice is doing under the hood is to enable the tracing, could you post the output of the client with tracing enabled when such a hang occurs?

Thanks!

Benoit.

rhochmuth · April 2005

Thanks Benoit, I'll work on this over the next few days. and then post something if/when I''m more certain of what I'm doing and I've taken this a bit further.

Thanks again --Roland

rhochmuth · May 2005

Long post. I've started to look into this further. I've seen some odd behaviours and I've learned a few things in the process so I thought I would just try and give a brain dump.

Here is what I learned. Ice will retry a proxy invocation or ice_ping depending on if Ice.RetryIntervals=-1 or some other value. The default = 0, which means retry once immediatey. This is good, but this doesn't appear to apply to SSL connections. In the case of SSL IceSSL.Client.Handshake. Retries applies and the default is 10. So if the timeout=5000 then by default a ice_ping on a SSL connection will take 50 secs (5*10) to return. What I've done is set Ice.RetryIntervals=-1 and IceSSL.Client.Handshake=2. This helps out a lot, but failures still take a long time to detect, since I've set timeouts to 5 secs. In my case the initial invocation can fail, which takes 5 secs, and then Ice interanlly appears to retry the connection twice, which accumulates to around 15-16 secs which previousely had been 50 secs with the default values. I set timeouts to 5 secs since we run our tests under extremely heavy loads and even at 5 secs sometimes invocations timeout so I'm not sure it is an option to reduce this. I'm also very concerned about seeting IceSSL.Client.Handshape=2 since the default is 10 and there was probably a good reason why this value was chosen.

Depending on the mode of failure, pulling the network cable on the client or server and whether running debug or release, things can take a different amount of time. For example, if I pull the cable on the client side, the ice_ping returns very quickly. However, pulling the cable on the server side in release mode (non debug), the return from the ice_ping can take 30-40 secs. In debug mode the ice-ping returns much quicker, around 10-15 secs.

I've also tried using ice_pings on oneway proxies and sometimes this will result in an immediate detection of a network failure, but more often I see many succesful returns from the oneway ice_ping and then after around twenty invocations an exception is returned. For example, if I pull the cable on the server, then I see many succesful ice_pings util an exception is reported. Pulling on the client side almost always results in an immediate response.

I would have liked to set the IceSSL.ClientHandshake=-1, but if I set this value to < 2, then I don't get any connection. so this doesn't appear to be an option. It would have been nice if only the initial connection to the server had the client handshake retry applied so I could get a more immediate response.

Other things about my environment. I've disable active connection management. I'm also using bi-dir connections.

Wondering what insight you might have to share on this. I'm not even sure that this is an Ice issue. I guess it could be in the Windows network stack or SSL too. In general I'm looking for a very quick and reliable method for detecting when a connection failure has occured in catastrophic failures, such as a network failure or blue-screen of death. One of the other tests I'm doing is stopping the server in the debugger to simulate a hung application. Perhaps I should just create a separate non SSL proxy to test, but I really wanted to test the actual connections. I'm also considering using a raw socket in this case, but I'm not sure if this will show better results and has some major disadvantages. Even if a raw socket worked better I woudl still have a potential issue with other ice invocations taking a while to return.

One other strange thing that I noticed in my program is that Ice doesn't appear to close down bi-dir connections if the client proxy and object adapter is deactivated and the server proxy (callback) is still around. I have to do an explicit close of connections in this case. So if I establish a bi-dir connection and then deactive the client object adapter and delete the client proxy the connection remains listed in tcpview. I need to invoke Ice::Connection::close() on the client too. Is this the expected behaviour. ACM is not enabled. I haven't created a program to demonstrate this outside of my program yet, but I'm pretty sure this is what I'm seeing.

Regards --Roland

marc · May 2005

rhochmuth wrote:

[...]
One other strange thing that I noticed in my program is that Ice doesn't appear to close down bi-dir connections if the client proxy and object adapter is deactivated and the server proxy (callback) is still around. I have to do an explicit close of connections in this case. So if I establish a bi-dir connection and then deactive the client object adapter and delete the client proxy the connection remains listed in tcpview. I need to invoke Ice::Connection::close() on the client too. Is this the expected behaviour. ACM is not enabled. I haven't created a program to demonstrate this outside of my program yet, but I'm pretty sure this is what I'm seeing.

Yes, this is expected behavior. The connection is an "outgoing" connection, i.e. a connection from a client to a server. Such connections are only closed by:

Shutdown of the communicator with Communicator::destroy()
Active Connection Management
The server closing the connection
An explicit call to Connection::close()

It doesn't matter whether there are still proxies that use the connection. Since proxies can be recreated at any time, the connection cannot be closed based on a proxy count.

The other questions will take a bit longer to respond to...

rhochmuth · May 2005

Hi Marc, Thanks for the respose regarding connections. I've been thinking about my use of ice_ping for determining whether a connection is alive or not. The one idea that I had for dealing with the catastrophic failure modes, if it really does come down to network issues, is to create my own heartbeat/sync mechanism and putting this under my applications control. It probably wouldn't be very difficult at all for me to create a method, called sync(), in the server and likewise in the client that would be invoked on regular intervals, like every sec for example. In the invocation the current time would be recorded. Instead of the thread that I have currently running that invokes ice_ping on regular intervals it could just check the delta between the last invocation of sync() and the current time. In other words there would be one thread running on the client side invoking sync() and another thread on the server side checking the delta on regular intervals. If delta > threshold, then the connection could be considered dead and proper clean-up could start. Although I'm still a little concerned about proper clean-up being able to make progress if there was an outstanding invocation that was basically waiting for a ice invocation to return elsewhere in my system. I'll probably end up giving this a try tomorrow, but I thought I would just throw this out as another alternative.

Regards --Roland

mes · May 2005

Hi Roland,

Regarding your SSL questions: Ice essentially treats OpenSSL as a black box, but we will investigate the retry and timeout issues you've mentioned. However, we're all a bit consumed with release testing at the moment, so your patience is appreciated.

Take care,
- Mark

mes · May 2005

Hi Roland,

Sorry for the late reply.

Ice will retry a proxy invocation or ice_ping depending on if Ice.RetryIntervals=-1 or some other value. The default = 0, which means retry once immediatey. This is good, but this doesn't appear to apply to SSL connections.

The retry logic in the Ice run time is handled in a transport-independent manner and applies equally well to TCP and SSL connections. The SSL plug-in has its own retry behavior, as you've seen, which is governed by the IceSSL.Client.Handshake.Retries property. If a timeout occurs while establishing an SSL connection, the SSL plug-in retries the specified number of times, and then the Ice core will retry if Ice.RetryIntervals allows it.

I'm also very concerned about setting IceSSL.Client.Handshake.Retries=2 since the default is 10 and there was probably a good reason why this value was chosen.

To be honest, I think the default value of 10 was an arbitrary choice, and I'm not convinced that the SSL plug-in even needs its own retry loop. We will investigate this further, and may change this behavior in the next major release.

I would have liked to set the IceSSL.Client.Handshake.Retries=-1, but if I set this value to < 2, then I don't get any connection. so this doesn't appear to be an option.

This is a bug in the SSL plug-in. Let me know if you'd like a patch.

Also be aware that the SSL plug-in currently enforces a minimum connection timeout of five seconds. We may change this as well in the next release.

Take care,
- Mark

rhochmuth · May 2005

Hi Mark, Thanks again for the response. This does possibly explain some of the issues I see occurring. I'm not sure I'll need a patch. Right now I'm setting the IceSSL.Client.Handshake.Retries=2. Anything < 2 and I don't get connected at all. If there aren't any issues with using this setting then I probably don't need a patch. I've also set Ice.Override.Retries=-1. My timeouts were are all set at 5 secs or greater.

One of the other interesting things that I've uncovered is that if I pull the network cable on the server then when a deactivate the object adapter and wait for deactivation this can take a very long time. Well beyond the timeout it appears. If I break into the debugger after I pull the cable I'll often see waitForDeactivate in the callstack. I guess this isn't entirely unexpected as the Ice runtime trys to cleanup gracefully in this case and I probably had inflight data that was interrupted midsteam when I pulled the cable.

To get around this issue what I do is just deactivate the object adapter and then put waitForDeactivate in a separate thread which only does waitForDeactivate or both deactivate and waitForDeactivate as follows:

class AdapterDeactivationThread : public IceUtil::Thread
{
public:
AdapterDeactivationThread(const Ice::ObjectAdapterPtr adapter) :
m_adapter(adapter) {};

virtual void run()
{
m_adapter->deactivate();
m_adapter->waitForDeactivate();
};

private:
const Ice::ObjectAdapterPtr m_adapter;
};

The reason for this (probably craziness) is that I noticed that if deactivate was just invoked not all local resources/sockets were cleaned up. The call to waitForDeactivate was necessary to get all the resources completely cleaned up. I had disabled ACM due to the issue in 2.1.0 with oneways and ACM and closed connections. Perhaps if I enable ACM this would also clean-up, but it wasn't an option until 2.1.1

This "AdapterDeactivationThread" approach seems to be working unless after pulling the cable I attempt to immediately exit the application. In this case the application will wait for all threads to stop, which can still take a while on wairForDeactivate. So the app may sit around for a while looking like it is hung trying to exit. This is a pretty rare occurence so I'm not too worried about it.

I guess there was one other issue worth mentioning. If a connection has a tragic failure and you attempt to use prx->ice_connection->close() then close will often take a little while. Even if force was set to true.

The other issues I had were more related to my app. When a bad failure occured, I failed to track or propogate this in many cases, and would still attempt graceful closure. In this process I might do something silly (although it wasn't obvious to me when I originally wrote the code) like invoke a proxy, and since the network had already failed, this invocation would wait for a Ice timeout. So all these timeouts were accumulative and adding up to quite a bit, especially with retries and SSLs default settings. I've since made some changes to propogate these failures quicker.

I've also implemented the appliction level heartbeat mechanism that I mentioned earlier in this thread and this is working out well, but I still wonder if this was the best approach or not.

Regards --Roland

Archived

Timeouts for unusual events

Comments

Categories