Invocation timeout not triggered when server host unexpectedly shutdown

jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

Hello,

I have an issue when it comes to invocation timeouts using Ice with C++ AMI TypeSafe :
=> Note : the request is sent and currently processing by the server
1) If the server is loo long to respond, I get the timeout triggered, this is all fine
2) If the server unexpectedly shutdown (power off, crash) no timeout is triggered !

I use Ice within the IceGrid framework in C++98 version 3.7.2.
I couldn't find topics on the forum that matches my case, therefore I must do something wrong.
But after some days trying to solve this issue, I need your help since I am out of ideas.

What do you need as additional data to get clues where this can come from ?
Best regards,

JC

Answers

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Hi,

    So the invocation hangs indefinitely instead of raising an Ice::InvocationTimeout once the invocation timeout delay expired?

    Can you tell us a bit more about the configuration of your client and your server? Are you using the default timeout settings (no "-t " option in the endpoints, no settings of the Ice timeout properties)? Are you modifying the ACM configuration?

    Did you enable network/retry tracing to see if it provided some clues? You can enable this tracing with Ice.Trace.Network=2 and Ice.Trace.Retry=1.

    Cheers,
    Benoit

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    Hello Benoit,

    Thanks for your fast reply.

    Yes I use the most simple setup with no specific configuration on timeouts. The endpoints are defined without specifying the timeouts. I do explicitly configure my proxy via the method ice_invocationTimeout().

    I can try to generate the tracing but I am always confuse where I should setup theproperties : In my service configuration file or on the node configuration file ? (or on the registry's ?)

    Regards,

    JC

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    Hi Benoit,

    I did generate the following log by enabling the tracing : So something has been caught by the Ice RunTime but it does not invoke my failure callback as I would expect it with the Type Safe AMI implementation.

    Regards

    JC


    c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
    connection lost: The specified network name is no longer available.
    -- 11/05/19 14:56:52.555 JSSGrid.Dispatcher-JSSGrid.Dispatcher: Network: closed tcp connection
    local address = 160.219.72.194:1471
    remote address = 160.219.72.204:55073
    c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
    connection lost: The specified network name is no longer available.
    -- 11/05/19 14:56:52.555 JSSGrid.Dispatcher-JSSGrid.Dispatcher: Network: closed tcp connection
    local address = 160.219.72.194:63787
    remote address = 160.219.72.204:1480
    c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
    connection lost: The specified network name is no longer available.
    -! 11/05/19 14:56:52.555 JSSGrid.Dispatcher-JSSGrid.Dispatcher: warning: Ice::Exception raised by AMI callback:
    c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
    connection lost: The specified network name is no longer available.
    -- 11/05/19 14:57:03.615 JSSGrid.Dispatcher-JSSGrid.Dispatcher: Network: closed tcp connection
    local address = 160.219.72.194:63778
    remote address = 160.219.72.194:59137
    -- 11/05/19 14:57:14.398 JSSGrid.Dispatcher-JSSGrid.Dispatcher: Network: closed tcp connection
    local address = 160.219.72.194:63777
    remote address = 160.219.72.194:4062
    -- 11/05/19 14:57:14.398 JSSGrid.Dispatcher: Network: closed tcp connection
    local address = 160.219.72.194:63781
    remote address = 160.219.72.194:4062
    -- 11/05/19 14:57:18.617 JSSGrid.Dispatcher: Network: closed tcp connection
    local address = 160.219.72.194:63782
    remote address = 160.219.72.194:59137
    -- 11/05/19 14:58:13.483 JSSGrid.Dispatcher: Network: closed tcp connection
    local address = ::1:63780
    remote address = ::1:63783

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Did you notice this warning from your tracing?

    -! 11/05/19 14:56:52.555 JSSGrid.Dispatcher-JSSGrid.Dispatcher: warning: Ice::Exception raised by AMI callback:
    c:\users\vagrant\workspace\ice-dist\3.7\dist-utils\build\ice\builds\ice-v141\cpp\src\ice\streamsocket.cpp:474: ::Ice::ConnectionLostException:
    connection lost: The specified network name is no longer available.
    

    It indicates that an AMI callback was called but it raised Ice::ConnectionLostException. Why isn't the callback catching this exception?

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing
    edited November 5

    Hi Benoit,

    This sounds crazy but I do catch this exception via my callback and this callback works very well when introducing a delay server-side to trigger Ice::InvocationTimeoutException (my mystake, it looks like it does fails either in this case, currently investigating....). And I do log on standard output when entering the callback, so I can tell in this specific case, it is not called !

    Here is more or less the code I use :


    // Error Callback
    void RequestCallback::FailureTypeSafe(const Ice::Exception& ex, const ClientRequestCookiePtr& cookie) {
    
        cout << "FAILURE" << endl;
    
        // We do rethrow the error to process it according its type
        try {
            ex.ice_throw();
        }
        // Request took too long
        catch (Ice::InvocationTimeoutException& ex) {
            /* Do something */
        }
        // General Network issue
        catch (Ice::ConnectionLostException& ex) {
            /* Do something */
        }
        // Unknown problem
       catch (Ice::Exception& ex) {
            /* Do something */
        }
    }
    
  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    By the way the callback "sent" is invoked with success and I have information from server side the request has been fully received and processing started when abruptly shut down.

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    I do not get why all my callbacks works but not the "failure" one

    class RequestCallback : public IceUtil::Shared {
    public:
        // ctor
        RequestCallback(int traceLevel, unsigned long requestId);
        // dtor
        ~RequestCallback();
    
        void RequestResultTypeSafe(const RequestResult& RequestResult, const ClientRequestCookiePtr& cookie);
        void RequestFailureTypeSafe(const Ice::Exception& ex, const ClientRequestCookiePtr& cookie);
        void RequestSentTypeSafe(bool sentSynchronously, const ClientRequestCookiePtr& cookie);
    private:
        // debug
        int            _traceLevel;
        unsigned long  _callBackId;
        unsigned long  _requestId;
    };
    typedef IceUtil::Handle<RequestCallback> RequestCallbackPtr;
    

    then in my code

    // instanciate a call back
    RequestCallbackPtr cb = new RequestCallback(_traceLevel,
                                                clientRequest->requestId);
    
    // create an Ice call with providing our implementation
    // We did chose a Type Safe API that is more safe and straightforward
    InterfaceGrid::Callback_RequestCompute_RequestComputePtr  requestCb = InterfaceGrid::newCallback_RequestCompute_RequestCompute(cb,
                                                                                                &RequestCallback::RequestResultTypeSafe,
                                                                                                &RequestCallback::RequestFailureTypeSafe,
                                                                                                &RequestCallback::RequestSentTypeSafe);
    
    
    
    // instanciate new cookie
    ClientRequestCookiePtr cookie = new ClientRequestCookie(clientRequest,
                                _traceLevel);
    
    
    // forward grid Request to server
    RequestComputePrx->begin_RequestCompute(clientRequest->gridRequest, requestCb, cookie);
    
  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    It's not clear why your callback is not called, it should be called. I think you should concentrate on first figure out the reason of the warning which indicates that an exception was raised by an AMI callback. Did you try to add Ice.Trace.Protocol=2 to your client's configuration to trace the protocol request/replies? This should help figuring out why reply triggers the AMI callback warning.

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    The problem is the following: I have to upgrade a 3rd party library in the server and this version has a problem that can raise a non recoverable exception. There were never problems so far and that is why I never noticed that the failure callback didn't worked. The logic behind was the server was restarted in case of issue, the request got the timeout response and was resent to an another available computation node. So we could workaround the issue we currently experience with this 3rd party library (that we have to deal with waiting for a potential fix from the editor).

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    In conclusion, being not able to avoid this crash from time to time, I need to enhance robustness on the client side to ensure continuity of service. Anyway I need to be sure that in case of problem, the failure callback can cope with the issues in general.

    Since I cannot test before tomorrow your suggestion to enable tracing the protocol, I will send you only then the output. But I am really short of ideas, it has already been some days testing and checking how to make it work without success before I decide myself to seek help on the forums. I must have done something wrong somewhere. But I did my best to stick to the examples and documentation as much as I could and I do not see what else to do to get more clues in order to solve this for good.

    Thanks again for your kind help, it is always a pleasure to post here as you're very reactive and knowledgeable,
    Regards,

    Jean-Christophe

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff
    edited November 6

    Hi Jean-Christophe,

    An Ice server crash, a cable disconnection, etc is something that an Ice client can handle. The AMI callbacks should definitely be called in this case.

    The warning indicates that you have an AMI callback that doesn't handle an exception. I don't know if it's related but it would be good to figure that one out first so that we can make sure it's not related.

    Does the problem also occurs with a debug client? If it's easy to reproduce, you could also try adding a breakpoint where the warning is printed out here: https://github.com/zeroc-ice/ice/blob/e9e8a21c572af0761b69f7c30e7fc3592280bf46/cpp/src/Ice/OutgoingAsync.cpp#L568 The data members of the OutgoingAsync object should provide information on which Ice operation's callback raised this exception.

    Also, a possible reason for a callback not being called is that there are no more threads available in the client thread pool to call the callback. So when this occurs, you could check the state of all the threads from your client and see if the client thread pool threads are busy or not.

    Cheers,
    Benoit

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing
    edited November 6

    Hi Benoit,

    Actually the Ice.Trace.Protocol=2 didn't print any new information or I did it wrong ....

    but I did check the generated code and added logging (with flush to file) to know what is going on and this shoed a a random behavior : Sometimes it goes through, and that is why I said the Ice Invocation Timeout, because sometime it gets called sometimes not. This is was I found out after many test runs.

    Now in the callback generated code we have:

    virtual void completed(const ::Ice::AsyncResultPtr& result) const
        {
            try
            {
                ret = proxy->end_ComputeRequest(result);
            }
            catch(const ::Ice::Exception& ex)
            {
                ::IceInternal::Callback<T, CT>::exception(result, ex);
                return;
            }
            if(_response)
            {
                (::IceInternal::Callback<T, CT>::_callback.get()->*_response)(ret, CT::dynamicCast(result->getCookie()));
            }
        }
    

    adding logging shows that sometimes it stops at end_ComputeRequest() and not processing further !!!

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing
    edited November 6

    I did the check with an empty callback so not exception can be raised from there.

    Unfortunately I couldn't put the breakpoint where you suggested because the server having the necessary resource to run the program does not have Visual Studio installed.

    About the busy threads it is unlikely to happen since there is only one client doing only one request triggering the problem. For the reproducibility, I do shut down the server when starting processing the request to generate this connection lost.

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff
    edited November 6

    Sorry, I'm confused now. I was suggesting to add the breakpoint to the client not to the server. Isn't the warning printed out by your client? Reading again your post, it sounds like it's the server "JSSGrid.Dispatcher" which is printing the warning? So this server also makes AMI client invocations?

    Can you please clarify the architecture of your application? Is it composed of just a client and server or is it more like client <-> middle tier <-> server where the middle-tier acts both as a client and server?

    The end_ComputeRequest operation blocking from the completed method isn't un-expected. Can you reproduce and provide a thread dump of all the threads of the program?

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing
    edited November 6

    You spotted right ! This dispatcher that I told you was a client is intermediate server acting as a client to compute nodes. Sorry having added this to the confusion.

    Client (Workstation) -> Dispatcher (server that I called client previously) -> computation node (server)

    The end_ComputeRequest operation blocking from the completed method isn't un-expected
    why do you say it isn't unexpected ? I will do a thread dump but how can I upload the minidump to you ?

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Sorry, I meant it's not expected: end_xxx should not block when it's called from the completed callback. Can you upload the thread stacks as as text file rather than a mini dump?

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    Sorry but actually I do not know how to generate the thread stack as text.
    Do you mean I should generate a dump and extract the thread stack with visual studio ?

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Hi,

    The easiest would be to reproduce this in a development environment where you can easily debug the middle-tier application by attaching the Visual Studio debugger when it hangs.

    You could also look into generating and reading dump files otherwise: https://docs.microsoft.com/en-us/visualstudio/debugger/using-dump-files?view=vs-2019

    Cheers,
    Benoit

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    Hi Benoit,

    This is not I do ignore your suggestion, it is just that I struggle a lot to make it. This is kind of difficult because I miss a lot of symbols (ice and ntdll for example) and I miss all outer frames in my dump. Furthermore very invasive firewall rules at my workplace makes development a nightmare. I try to install Visual Studio on the machine running the program I want to debug, but there is not enough disk space, and so on and so on... really frustrating.

    I still wonder why with almost the same client program, when I kill the server, sometimes I get the notification, sometimes not:

    I did made a callback on failure do only (cout << "Failure << endl;)
    So if it works, then it works reliability but if I recompile changing almost nothing, adding the same line for example, then it may reliably fail to call failure callback because It will get stuck systematically in the XXX_end() in the complete() method. On the functional side nothing has changed, It looks it is dependent of something at compilation time. I cannot guarantee this is the real problem, but it showed me this behavior so far.

    Is there some special compilation setting we should be aware ?
    Best regards,

    JC

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Hi,

    I don't think there are special compilation settings which could affect this.

    So even with this other client program where you can reproduce the problem, you don't have the possibility to attach a debugger?

    Can you provide the definition of your cookie class? Could you perhaps try to to simplify your code to eliminate the cookie? Rather than providing it to the AMI operation, you could provide it to the RequestCallback constructor and cache it as a data member of the callback. I wonder if a memory management issue related to the cookie object could be the cause of this random failure.

    Cheers,
    Benoit.

  • jcbellonjcbellon Member BellonOrganization: J. Safra Sarasin BankProject: Grid Computing

    Hello Benoit,

    I will try to install Visual Studio on the server once the disk capacity will be increased because I can't attach the debugger on another machine due to internal issues.

    I did what you suggested, using the cookie as a member of the RequestCallback: Since the cookie is necessary to my algo, I cannot get rid of it. Unfortunately it didn't helped and still show the same behavior.

    What puzzle me the most is that the Error Log states "warning: Ice::Exception raised by AMI callback:" (for both ConnectionLostException and InvocationTimeoutException) but it does not even enter in the callback. As said before I inserted some trace in the complete() method and it shows clearly that it get stuck in the XXXX_end(). So why does it log such error, i.e. the callback raise an exception, without even having tried to invoke it ?

    I keep you posted as soon as I get some results with the debugger once ready.

    Regards,

    JC

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    I also don't see how the thread could be waiting on the end_xxx call and still print this message given that this message is printed only if the completed method raises an exception (and it can't be raising an exception if it's waiting on end_xxx...). To me, the message is printed by a different run of the completed method.

    How do you know actually that end_xxx doesn't complete if you can't attach the process with a debugger? Did you decorate all the calls from the completed method implementation with some tracing such as the following?

    cout << this << " : calling end_ComputeRequest" << endl;
    try
    {
        ret = proxy->end_ComputeRequest(result);
    }
    catch(const ::Ice::Exception& ex)
    {
        cout << this << " : end_ComputeRequest completed with exception:\n" << ex << endl;
        try
        {
             ::IceInternal::Callback<T, CT>::exception(result, ex);
        }
        catch(...)
        {
             cout << this << " : ComputeRequest completed with exception in the exception callback" << endl;
             throw;
        }
        return;
    }
    catch(...)
    {
         cout << this << " : end_ComputeRequest completed with unexpected exception" << endl;
         throw;
    }
    cout << this << " : end_ComputeRequest completed" << endl;
    
    if(_response)
    {
        try
        {
             (::IceInternal::Callback<T, CT>::_callback.get()->*_response)(ret, CT::dynamicCast(result->getCookie()));
        }
        catch(...)
        {
             cout << this << " : ComputeRequest completed with exception in the response callback" << endl;
             throw;
        }
    }
    cout << this << " : ComputeRequest completed" << endl;
    

    Best would be to be able to attach the debugger to be able to check the state of the process's threads, hopefully you can get this working soon :wink:

Sign In or Register to comment.