Invocation timeout not triggered when server host unexpectedly shutdown

jcbellon · November 2019

Hello,

I have an issue when it comes to invocation timeouts using Ice with C++ AMI TypeSafe :
=> Note : the request is sent and currently processing by the server
1) If the server is loo long to respond, I get the timeout triggered, this is all fine
2) If the server unexpectedly shutdown (power off, crash) no timeout is triggered !

I use Ice within the IceGrid framework in C++98 version 3.7.2.
I couldn't find topics on the forum that matches my case, therefore I must do something wrong.
But after some days trying to solve this issue, I need your help since I am out of ideas.

What do you need as additional data to get clues where this can come from ?
Best regards,

JC

benoit · November 2019

Hi,

So the invocation hangs indefinitely instead of raising an Ice::InvocationTimeout once the invocation timeout delay expired?

Can you tell us a bit more about the configuration of your client and your server? Are you using the default timeout settings (no "-t " option in the endpoints, no settings of the Ice timeout properties)? Are you modifying the ACM configuration?

Did you enable network/retry tracing to see if it provided some clues? You can enable this tracing with Ice.Trace.Network=2 and Ice.Trace.Retry=1.

Cheers,
Benoit

jcbellon · November 2019

Hello Benoit,

Thanks for your fast reply.

Yes I use the most simple setup with no specific configuration on timeouts. The endpoints are defined without specifying the timeouts. I do explicitly configure my proxy via the method ice_invocationTimeout().

I can try to generate the tracing but I am always confuse where I should setup theproperties : In my service configuration file or on the node configuration file ? (or on the registry's ?)

Regards,

JC

jcbellon · November 2019

Hi Benoit,

I did generate the following log by enabling the tracing : So something has been caught by the Ice RunTime but it does not invoke my failure callback as I would expect it with the Type Safe AMI implementation.

Regards

JC

c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
-- 11/05/19 14:56:52.555 JSSGrid.Dispatcher-JSSGrid.Dispatcher: Network: closed tcp connection
local address = 160.219.72.194:1471
remote address = 160.219.72.204:55073
c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
-- 11/05/19 14:56:52.555 JSSGrid.Dispatcher-JSSGrid.Dispatcher: Network: closed tcp connection
local address = 160.219.72.194:63787
remote address = 160.219.72.204:1480
c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
-! 11/05/19 14:56:52.555 JSSGrid.Dispatcher-JSSGrid.Dispatcher: warning: Ice::Exception raised by AMI callback:
c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
-- 11/05/19 14:57:03.615 JSSGrid.Dispatcher-JSSGrid.Dispatcher: Network: closed tcp connection
local address = 160.219.72.194:63778
remote address = 160.219.72.194:59137
-- 11/05/19 14:57:14.398 JSSGrid.Dispatcher-JSSGrid.Dispatcher: Network: closed tcp connection
local address = 160.219.72.194:63777
remote address = 160.219.72.194:4062
-- 11/05/19 14:57:14.398 JSSGrid.Dispatcher: Network: closed tcp connection
local address = 160.219.72.194:63781
remote address = 160.219.72.194:4062
-- 11/05/19 14:57:18.617 JSSGrid.Dispatcher: Network: closed tcp connection
local address = 160.219.72.194:63782
remote address = 160.219.72.194:59137
-- 11/05/19 14:58:13.483 JSSGrid.Dispatcher: Network: closed tcp connection
local address = ::1:63780
remote address = ::1:63783

benoit · November 2019

Did you notice this warning from your tracing?

-! 11/05/19 14:56:52.555 JSSGrid.Dispatcher-JSSGrid.Dispatcher: warning: Ice::Exception raised by AMI callback:
c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.

It indicates that an AMI callback was called but it raised Ice::ConnectionLostException. Why isn't the callback catching this exception?

jcbellon · November 2019

Hi Benoit,

This sounds crazy but I do catch this exception via my callback and this callback works very well when introducing a delay server-side to trigger Ice::InvocationTimeoutException (my mystake, it looks like it does fails either in this case, currently investigating....). And I do log on standard output when entering the callback, so I can tell in this specific case, it is not called !

Here is more or less the code I use :

// Error Callback
void RequestCallback::FailureTypeSafe(const Ice::Exception& ex, const ClientRequestCookiePtr& cookie) {

    cout << "FAILURE" << endl;

    // We do rethrow the error to process it according its type
    try {
        ex.ice_throw();
    }
    // Request took too long
    catch (Ice::InvocationTimeoutException& ex) {
        /* Do something */
    }
    // General Network issue
    catch (Ice::ConnectionLostException& ex) {
        /* Do something */
    }
    // Unknown problem
   catch (Ice::Exception& ex) {
        /* Do something */
    }
}

jcbellon · November 2019

By the way the callback "sent" is invoked with success and I have information from server side the request has been fully received and processing started when abruptly shut down.

jcbellon · November 2019

I do not get why all my callbacks works but not the "failure" one

class RequestCallback : public IceUtil::Shared {
public:
    // ctor
    RequestCallback(int traceLevel, unsigned long requestId);
    // dtor
    ~RequestCallback();

    void RequestResultTypeSafe(const RequestResult& RequestResult, const ClientRequestCookiePtr& cookie);
    void RequestFailureTypeSafe(const Ice::Exception& ex, const ClientRequestCookiePtr& cookie);
    void RequestSentTypeSafe(bool sentSynchronously, const ClientRequestCookiePtr& cookie);
private:
    // debug
    int            _traceLevel;
    unsigned long  _callBackId;
    unsigned long  _requestId;
};
typedef IceUtil::Handle<RequestCallback> RequestCallbackPtr;

then in my code

// instanciate a call back
RequestCallbackPtr cb = new RequestCallback(_traceLevel,
                                            clientRequest->requestId);

// create an Ice call with providing our implementation
// We did chose a Type Safe API that is more safe and straightforward
InterfaceGrid::Callback_RequestCompute_RequestComputePtr  requestCb = InterfaceGrid::newCallback_RequestCompute_RequestCompute(cb,
                                                                                            &RequestCallback::RequestResultTypeSafe,
                                                                                            &RequestCallback::RequestFailureTypeSafe,
                                                                                            &RequestCallback::RequestSentTypeSafe);



// instanciate new cookie
ClientRequestCookiePtr cookie = new ClientRequestCookie(clientRequest,
                            _traceLevel);


// forward grid Request to server
RequestComputePrx->begin_RequestCompute(clientRequest->gridRequest, requestCb, cookie);

benoit · November 2019

It's not clear why your callback is not called, it should be called. I think you should concentrate on first figure out the reason of the warning which indicates that an exception was raised by an AMI callback. Did you try to add Ice.Trace.Protocol=2 to your client's configuration to trace the protocol request/replies? This should help figuring out why reply triggers the AMI callback warning.

jcbellon · November 2019

The problem is the following: I have to upgrade a 3rd party library in the server and this version has a problem that can raise a non recoverable exception. There were never problems so far and that is why I never noticed that the failure callback didn't worked. The logic behind was the server was restarted in case of issue, the request got the timeout response and was resent to an another available computation node. So we could workaround the issue we currently experience with this 3rd party library (that we have to deal with waiting for a potential fix from the editor).

jcbellon · November 2019

In conclusion, being not able to avoid this crash from time to time, I need to enhance robustness on the client side to ensure continuity of service. Anyway I need to be sure that in case of problem, the failure callback can cope with the issues in general.

Since I cannot test before tomorrow your suggestion to enable tracing the protocol, I will send you only then the output. But I am really short of ideas, it has already been some days testing and checking how to make it work without success before I decide myself to seek help on the forums. I must have done something wrong somewhere. But I did my best to stick to the examples and documentation as much as I could and I do not see what else to do to get more clues in order to solve this for good.

Thanks again for your kind help, it is always a pleasure to post here as you're very reactive and knowledgeable,
Regards,

Jean-Christophe

benoit · November 2019

Hi Jean-Christophe,

An Ice server crash, a cable disconnection, etc is something that an Ice client can handle. The AMI callbacks should definitely be called in this case.

The warning indicates that you have an AMI callback that doesn't handle an exception. I don't know if it's related but it would be good to figure that one out first so that we can make sure it's not related.

Does the problem also occurs with a debug client? If it's easy to reproduce, you could also try adding a breakpoint where the warning is printed out here: https://github.com/zeroc-ice/ice/blob/e9e8a21c572af0761b69f7c30e7fc3592280bf46/cpp/src/Ice/OutgoingAsync.cpp#L568 The data members of the OutgoingAsync object should provide information on which Ice operation's callback raised this exception.

Also, a possible reason for a callback not being called is that there are no more threads available in the client thread pool to call the callback. So when this occurs, you could check the state of all the threads from your client and see if the client thread pool threads are busy or not.

Cheers,
Benoit

jcbellon · November 2019

Hi Benoit,

Actually the Ice.Trace.Protocol=2 didn't print any new information or I did it wrong ....

but I did check the generated code and added logging (with flush to file) to know what is going on and this shoed a a random behavior : Sometimes it goes through, and that is why I said the Ice Invocation Timeout, because sometime it gets called sometimes not. This is was I found out after many test runs.

Now in the callback generated code we have:

virtual void completed(const ::Ice::AsyncResultPtr& result) const
    {
        try
        {
            ret = proxy->end_ComputeRequest(result);
        }
        catch(const ::Ice::Exception& ex)
        {
            ::IceInternal::Callback<T, CT>::exception(result, ex);
            return;
        }
        if(_response)
        {
            (::IceInternal::Callback<T, CT>::_callback.get()->*_response)(ret, CT::dynamicCast(result->getCookie()));
        }
    }

adding logging shows that sometimes it stops at end_ComputeRequest() and not processing further !!!

jcbellon · November 2019

I did the check with an empty callback so not exception can be raised from there.

Unfortunately I couldn't put the breakpoint where you suggested because the server having the necessary resource to run the program does not have Visual Studio installed.

About the busy threads it is unlikely to happen since there is only one client doing only one request triggering the problem. For the reproducibility, I do shut down the server when starting processing the request to generate this connection lost.

benoit · November 2019

Sorry, I'm confused now. I was suggesting to add the breakpoint to the client not to the server. Isn't the warning printed out by your client? Reading again your post, it sounds like it's the server "JSSGrid.Dispatcher" which is printing the warning? So this server also makes AMI client invocations?

Can you please clarify the architecture of your application? Is it composed of just a client and server or is it more like client <-> middle tier <-> server where the middle-tier acts both as a client and server?

The end_ComputeRequest operation blocking from the completed method isn't un-expected. Can you reproduce and provide a thread dump of all the threads of the program?

jcbellon · November 2019

You spotted right ! This dispatcher that I told you was a client is intermediate server acting as a client to compute nodes. Sorry having added this to the confusion.

Client (Workstation) -> Dispatcher (server that I called client previously) -> computation node (server)

The end_ComputeRequest operation blocking from the completed method isn't un-expected
why do you say it isn't unexpected ? I will do a thread dump but how can I upload the minidump to you ?

benoit · November 2019

Sorry, I meant it's not expected: end_xxx should not block when it's called from the completed callback. Can you upload the thread stacks as as text file rather than a mini dump?

jcbellon · November 2019

Sorry but actually I do not know how to generate the thread stack as text.
Do you mean I should generate a dump and extract the thread stack with visual studio ?

benoit · November 2019

Hi,

The easiest would be to reproduce this in a development environment where you can easily debug the middle-tier application by attaching the Visual Studio debugger when it hangs.

You could also look into generating and reading dump files otherwise: https://docs.microsoft.com/en-us/visualstudio/debugger/using-dump-files?view=vs-2019

Cheers,
Benoit

jcbellon · November 2019

Hi Benoit,

This is not I do ignore your suggestion, it is just that I struggle a lot to make it. This is kind of difficult because I miss a lot of symbols (ice and ntdll for example) and I miss all outer frames in my dump. Furthermore very invasive firewall rules at my workplace makes development a nightmare. I try to install Visual Studio on the machine running the program I want to debug, but there is not enough disk space, and so on and so on... really frustrating.

I still wonder why with almost the same client program, when I kill the server, sometimes I get the notification, sometimes not:

I did made a callback on failure do only (cout << "Failure << endl;)
So if it works, then it works reliability but if I recompile changing almost nothing, adding the same line for example, then it may reliably fail to call failure callback because It will get stuck systematically in the XXX_end() in the complete() method. On the functional side nothing has changed, It looks it is dependent of something at compilation time. I cannot guarantee this is the real problem, but it showed me this behavior so far.

Is there some special compilation setting we should be aware ?
Best regards,

JC

benoit · November 2019

Hi,

I don't think there are special compilation settings which could affect this.

So even with this other client program where you can reproduce the problem, you don't have the possibility to attach a debugger?

Can you provide the definition of your cookie class? Could you perhaps try to to simplify your code to eliminate the cookie? Rather than providing it to the AMI operation, you could provide it to the RequestCallback constructor and cache it as a data member of the callback. I wonder if a memory management issue related to the cookie object could be the cause of this random failure.

Cheers,
Benoit.

jcbellon · November 2019

Hello Benoit,

I will try to install Visual Studio on the server once the disk capacity will be increased because I can't attach the debugger on another machine due to internal issues.

I did what you suggested, using the cookie as a member of the RequestCallback: Since the cookie is necessary to my algo, I cannot get rid of it. Unfortunately it didn't helped and still show the same behavior.

What puzzle me the most is that the Error Log states "warning: Ice::Exception raised by AMI callback:" (for both ConnectionLostException and InvocationTimeoutException) but it does not even enter in the callback. As said before I inserted some trace in the complete() method and it shows clearly that it get stuck in the XXXX_end(). So why does it log such error, i.e. the callback raise an exception, without even having tried to invoke it ?

I keep you posted as soon as I get some results with the debugger once ready.

Regards,

JC

benoit · November 2019

I also don't see how the thread could be waiting on the end_xxx call and still print this message given that this message is printed only if the completed method raises an exception (and it can't be raising an exception if it's waiting on end_xxx...). To me, the message is printed by a different run of the completed method.

How do you know actually that end_xxx doesn't complete if you can't attach the process with a debugger? Did you decorate all the calls from the completed method implementation with some tracing such as the following?

cout << this << " : calling end_ComputeRequest" << endl;
try
{
    ret = proxy->end_ComputeRequest(result);
}
catch(const ::Ice::Exception& ex)
{
    cout << this << " : end_ComputeRequest completed with exception:\n" << ex << endl;
    try
    {
         ::IceInternal::Callback<T, CT>::exception(result, ex);
    }
    catch(...)
    {
         cout << this << " : ComputeRequest completed with exception in the exception callback" << endl;
         throw;
    }
    return;
}
catch(...)
{
     cout << this << " : end_ComputeRequest completed with unexpected exception" << endl;
     throw;
}
cout << this << " : end_ComputeRequest completed" << endl;

if(_response)
{
    try
    {
         (::IceInternal::Callback<T, CT>::_callback.get()->*_response)(ret, CT::dynamicCast(result->getCookie()));
    }
    catch(...)
    {
         cout << this << " : ComputeRequest completed with exception in the response callback" << endl;
         throw;
    }
}
cout << this << " : ComputeRequest completed" << endl;

Best would be to be able to attach the debugger to be able to check the state of the process's threads, hopefully you can get this working soon

jcbellon · November 2019

Hello Benoit,

Sorry having gone silent for a week, I have been pretty busy once back on track: Because yes at last, I solved the issue: That was a compiling issue in the end, so I wanted to share my findings and my mistake to make you also aware of this dependency.

Actually the production servers runs Windows Server 2016 whereas the client workstations and development environment is Windows 7. So all was compiled with Windows SDK 8.1 and was running fine when I did run my tests. But when I installed Visual Studio on the server to attach to process in order to debug my issue, I did recompiled on the target machine and the problem didn't appear. Looking at the difference of compiling options, I realized that Visual Studio had automatically upgraded the Windows SDK to 10.0.XXXXX.X. Then I did installed the package on my workstation, recompiled and deployed, problem is gone and failure callback is properly invoked.

So having the proper Windows SDK seems mandatory for the failure callback. Actually it seems necessary for the XXXX_end() method to fire exception in case of ConnectionLost or InvokationTimeout errors. Otherwise it gets stuck and raise the AMI callback exception before invoking it perhaps it fails itself to invoke it ?

Now everything runs flawlessly. Thanks a lot for helping me digging into that issue and giving me good tips that led me on the solution.

Have a great week, regards,

JC

benoit · November 2019

Hi,

I'm glad you finally found out the cause of this issue. I don't know why it failed with different Windows SDK versions, to me it sounds like an unexpected exception occurred on the AMI callback thread causing the callback to not be called (memory violation issue caused by some binary incompatibility between the different Windows SDKs?). You could have tried to run the executable under the Visual Studio debugger without recompiling it to see if it provided additional information.

Cheers,
Benoit

jcbellon · November 2019

Hi Benoit,

I tried to do what you did suggested but for some reason that is not clear to me, Visual Studio claimed that he couldn't match the code with the executable (I copied my project folder from the workstation to the server), this is why I did recompiled on the target machine. But after this step the problem didn't reappeared and therefore I cannot provide you more detail.
Yes I was surprised to this Windows SDK dependency, especially when the callbacks for Success and Sent were working absolutely fine.

Regards,

JC

Archived

Invocation timeout not triggered when server host unexpectedly shutdown

Comments

Categories