Archived

This forum has been archived. Please start a new discussion on GitHub.

IceGrid on-demand activation with Ice.ServerIdleTime

I've been trying to use IceGrid to start servers on an on-demand basis with a time-out to get rid of unused servers. It's been working pretty well, except for one puzzling problem.

I have a server, RemoteAbstractServer-cl6 which contacts another server, LocalAbstractServer-cl6. Both are IceGrid on-demand servers. Both have Ice.ServerIdleTime set to 10 seconds.

When RemoteAbstractServer-cl6 is unused for 10 seconds, it times out. Then if I try to contact it again, it starts up but fails to contact LocalAbstractServer-cl6. What's really strange is that about 15 seconds later, everything is OK again. RemoteAbstractServer-cl6 can contact LocalAbstractServer-cl6. During that period, ps shows that LocalAbstractServer-cl6 is a defunct process.

The error message is:
Exception in RemoteAbstractGenerator Ice.NoEndpointException
    proxy = "LocalAbstractGenerator -t @
LocalAbstractServer-cl6.LocalAbstractGeneratorAdapter"

The app.xml server descriptions are:
<server id="LocalAbstractServer-cl6"
                    exe="./LocalAbstractGeneratorServer"
                    pwd="bin"
                     activation="on-demand">
                <property name="Ice.ServerIdleTime" value="10"/>
                <adapter name="LocalAbstractGeneratorAdapter"
                         register-process="true"
                         endpoints="tcp">
                </adapter>
            </server>
         <server id="RemoteAbstractServer-cl6"
                    exe="java"
                    pwd="bin"
                     activation="on-demand">
                <option>-jar</option>
                <option>RemoteAbstractGenerator.jar</option>
                <property name="Ice.ServerIdleTime" value="10"/>
                <adapter name="RemoteAbstractGeneratorAdapter"
                         register-process="true"
                         endpoints="tcp">
                </adapter>
            </server>

Any ideas why RemoteAbstractServer-cl6 would not be able to contact LocalAbstractServer-cl6 after a timeout? Why would it be a defunct process?
It's parent process is icegridnode. What would prevent icegridnode from wait()ing on it? Any workarounds?

Comments

  • bernard
    bernard Jupiter, FL
    It looks like for some reason LocalAbstractServer-cl6 does not exit properly. I suspect that while it's "half-dead" (defunct) IceGrid (the icegridnode) does not detect that it's dead.

    I would first check that you properly destroy [not just shutdown] the Ice communicator in this process. Also, which version of Ice do you use?

    Cheers,
    Bernard
  • I put printfs/printlns in Both servers around the relevant procedure calls. Here's what I'm seeing:

    LocalAbstractGeneratorServer: Got Shutdown
    LocalAbstractGeneratorServer: calling Communicator::destroy
    LocalAbstractGeneratorServer: return from Communicator::destroy
    LocalAbstractGeneratorServer: done, returning 0
    RemoteAbstractGenerator: Got shutdown
    RemoteAbstractGenerator: calling Communicator::destroy
    RemoteAbstractGenerator: return from Communicator::destroy
    RemoteAbstractGenerator: exit with 0

    So I think that the Communicator is getting destroyed in both servers.

    One strange thing is that the zombie process eventually gets reaped. It just takes a long time - around 15 seconds.
  • The problem occurs on icegridnode version 3.1.0 on a Gentoo system.
  • benoit
    benoit Rennes, France
    Hi Bruce,

    It shouldn't take so long for the IceGrid node to detect the process termination -- I don't know why you are seeing this defunct process for so long. I'm afraid I can't say what could be wrong wihtout more information though.

    Could you run the icegridnode with the properties Ice.Logger.Timestamp=1, IceGrid.Node.Trace.Activator=2 and IceGrid.Node.Trace.Server=3 and post the trace output here once you reproduced the problem?

    Also, are the 2 servers running on the same node or on different nodes?

    Cheers,
    Benoit.
  • benoit
    benoit Rennes, France
    Btw, are you still by any chance using LinuxThreads on your Gentoo system? Ice doesn't support anymore LinuxThreads, only NPTL (among other things, there was some issues with waitpid() and LinuxThreads, see for example [thread=2472]this thread[/thread]).

    To check the version of your pthread library you can try to the following command:

    $ getconf "GNU_LIBPTHREAD_VERSION"

    Cheers,
    Benoit.
  • Thank you for looking into this.

    The problem occurs running on a single server. The problem occurs on an NPTL 2.4 system.

    Here is the output of the first (successful) run with the requested tracing turned on:
    [ 10/09/06 09:31:07.558 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Activating' ]
    [ 10/09/06 09:31:07.558 icegridnode: Activator: activating server `RemoteAbstractServer-rack2'
      path = java
      pwd = bin
      uid/gid = 2193/100
      args = java -jar RemoteAbstractGenerator.jar --Ice.Config=/etc/wink/Ice/node/servers/RemoteAbstractServer-rack2/config/config ]
    [ 10/09/06 09:31:07.563 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `WaitForActivation' ]
    In RemoteAbstractGenerator
    [ 10/09/06 09:31:08.039 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Active' ]
    In RemoteAbstractGenerator
    nrequests = 1
    url:   http://www.google.com/
    query: google
    DbConnection::do_stmt_select_url_table before executeQuery ff90821feeb2b02a33a6f9fc8e5f3fcd
    DbConnection::do_stmt_select_url_table rs.first is true
    stmt_select_url_table returned true
    urlte.docuid 2818802
    DbConnection::do_stmt_select_doc_table before executeQuery e9eabf20eee3312f1440b3ccc3b48518
    DbConnection::do_stmt_select_doc_table rs.first is true
    stmt_select_doc_table returned true
    docte.path 2006/06/21/07/25/
    The proxystring is LocalAbstractGenerator@LocalAbstractServer-rack2.LocalAbstractGeneratorAdapter
    [ 10/09/06 09:31:08.541 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `Activating' ]
    [ 10/09/06 09:31:08.541 icegridnode: Activator: activating server `LocalAbstractServer-rack2'
      path = ./LocalAbstractGeneratorServer
      pwd = bin
      uid/gid = 2193/100
      args = ./LocalAbstractGeneratorServer --Ice.Config=/etc/wink/Ice/node/servers/LocalAbstractServer-rack2/config/config ]
    [ 10/09/06 09:31:08.544 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `WaitForActivation' ]
    Xlib:  extension "XInputExtension" missing on display ":0.0".
    Failed to get list of devices
    [ 10/09/06 09:31:08.838 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `Active' ]
    [ 10/09/06 09:31:08.842 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `Activating' ]
    [ 10/09/06 09:31:08.842 icegridnode: Activator: activating server `LocalGetDocumentsServer-rack2'
      path = java
      pwd = bin
      uid/gid = 2193/100
      args = java -jar LocalGetDocumentsServer.jar --Ice.Config=/etc/wink/Ice/node/servers/LocalGetDocumentsServer-rack2/config/config ]
    [ 10/09/06 09:31:08.844 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `WaitForActivation' ]
    In LocalGetDocuments
    [ LocalGetDocumentsServer-rack2: Network: accepting tcp connections at 127.0.0.1:38404 ]
    [ LocalGetDocumentsServer-rack2: Network: accepting tcp connections at 192.168.1.33:38405 ]
    [ LocalGetDocumentsServer-rack2: Network: accepting tcp connections at 192.168.1.32:38406 ]
    [ LocalGetDocumentsServer-rack2: Network: tcp connection established
      local address = 192.168.1.33:48108
      remote address = 192.168.1.33:9018 ]
    [ LocalGetDocumentsServer-rack2: Network: tcp connection established
      local address = 192.168.1.32:53759
      remote address = 192.168.1.32:38363 ]
    [ 10/09/06 09:31:09.193 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `Active' ]
    [ LocalGetDocumentsServer-rack2: Network: accepted tcp connection
      local address = 192.168.1.32:38406
      remote address = 192.168.1.32:49722 ]
    secondsSinceEpoch=1150899906 2006-06-21 07:25:06
    Xlib:  extension "XInputExtension" missing on display ":0.0".
    Failed to get list of devices
    kbuildsycoca running...
    [ LocalGetDocumentsServer-rack2: Network: stopping to accept tcp connections at 127.0.0.1:38404 ]
    [ LocalGetDocumentsServer-rack2: Network: stopping to accept tcp connections at 192.168.1.33:38405 ]
    [ LocalGetDocumentsServer-rack2: Network: stopping to accept tcp connections at 192.168.1.32:38406 ]
    [ LocalGetDocumentsServer-rack2: Network: closing tcp connection
      local address = 192.168.1.32:38406
      remote address = 192.168.1.32:49722 ]
    [ LocalGetDocumentsServer-rack2: Network: closing tcp connection
      local address = 192.168.1.33:48108
      remote address = 192.168.1.33:9018 ]
    [ LocalGetDocumentsServer-rack2: Network: closing tcp connection
      local address = 192.168.1.32:53759
      remote address = 192.168.1.32:38363 ]
    [ 10/09/06 09:31:19.364 icegridnode: Activator: detected termination of server `LocalGetDocumentsServer-rack2' ]
    [ 10/09/06 09:31:19.364 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `Deactivating' ]
    [ 10/09/06 09:31:19.364 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `Inactive' ]
    LocalAbstractGeneratorServer: ActivatingLocalAbstractGeneratorServer: Waiting For Shutdownlocalhost: rack2
    1 requests
    Request 0
        localpath: 2006/06/21/07/25//2818802
        query:     google
    nitems = 1
    LocalAbstractGeneratorServer: Got Shutdown[ 10/09/06 09:31:20.008 icegridnode: Activator: detected termination of server `RemoteAbstractServer-rack2' ]
    [ 10/09/06 09:31:20.008 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Deactivating' ]
    [ 10/09/06 09:31:20.008 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Inactive' ]
    

    The second run fails with an Ice.NoEndpointException:
    [ 10/09/06 09:31:22.897 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Activating' ]
    [ 10/09/06 09:31:22.898 icegridnode: Activator: activating server `RemoteAbstractServer-rack2'
      path = java
      pwd = bin
      uid/gid = 2193/100
      args = java -jar RemoteAbstractGenerator.jar --Ice.Config=/etc/wink/Ice/node/servers/RemoteAbstractServer-rack2/config/config ]
    [ 10/09/06 09:31:22.900 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `WaitForActivation' ]
    In RemoteAbstractGenerator
    [ 10/09/06 09:31:23.379 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Active' ]
    In RemoteAbstractGenerator
    nrequests = 1
    url:   http://www.google.com/
    query: google
    DbConnection::do_stmt_select_url_table before executeQuery ff90821feeb2b02a33a6f9fc8e5f3fcd
    DbConnection::do_stmt_select_url_table rs.first is true
    stmt_select_url_table returned true
    urlte.docuid 2818802
    DbConnection::do_stmt_select_doc_table before executeQuery e9eabf20eee3312f1440b3ccc3b48518
    DbConnection::do_stmt_select_doc_table rs.first is true
    stmt_select_doc_table returned true
    docte.path 2006/06/21/07/25/
    The proxystring is LocalAbstractGenerator@LocalAbstractServer-rack2.LocalAbstractGeneratorAdapter
    Exception in RemoteAbstractGenerator Ice.NoEndpointException
        proxy = "LocalAbstractGenerator -t @ LocalAbstractServer-rack2.LocalAbstractGeneratorAdapter"
    [ 10/09/06 09:31:33.937 icegridnode: Activator: detected termination of server `RemoteAbstractServer-rack2' ]
    [ 10/09/06 09:31:33.937 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Deactivating' ]
    [ 10/09/06 09:31:33.938 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Inactive' ]
    [ 10/09/06 09:31:40.010 icegridnode: Activator: detected termination of server `LocalAbstractServer-rack2' ]
    [ 10/09/06 09:31:40.010 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `Deactivating' ]
    [ 10/09/06 09:31:40.010 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `Inactive' ]
    
  • benoit
    benoit Rennes, France
    Hi Bruce,

    Thanks for the traces but unfortunately I still don't know what's happening. All I can say is that for some reasons the LocalAbstractServer-rack2 server termination appears to be detected well after the "Got Shutdown" message.

    I don't see the other messages that you mentioned in your first post ("calling Communicator::destroy", etc), are you sure that the server properly shuts down and terminate?

    Could you run again your test with the following trace properties enabled on the node:
    Ice.Logger.Timestamp=1
    IceGrid.Node.Trace.Activator=2
    IceGrid.Node.Trace.Server=3
    IceGrid.node.Trace.Adapter=3

    And the following poperty for the IceGrid regsitry:
    IceGrid.Registry.Trace.Locator=3

    Ideally, the easiest for me to investigate this problem would be a small self-contained test case. I tried to reproduce your issue with the IceGrid simple demo by introducing a dependency between 2 servers but it worked as expected.

    Cheers,
    Benoit.
  • Hello, I've attached the test rerun with the requested trace properties. Also, I've restored the debugging print statements indicating waiting for shutdown, go t shutdown, destroying communicator and communicator destroyed.

    One other possible clue. In this case, there are actually 3 servers and a client involved:

    TestClient ->
    RemoteAbstractGenerator ->
    LocalAbstractGenerator ->
    LocalGetDocumentServer

    The adapter timeout messages at 10/09/06 12:56:04.743 look very interesting.
  • A clue?

    I put together a simple test case:

    t1.java-> (Client)
    l1.java ->
    l2.cpp ->
    l3.java

    Unfortunately, the test case in its simplest form does not exhibit the problem. However, when I put in a delay between waitForShutdown and destroy in l2.cpp, I do got a consistent Ice.NoEndpointException. I'm not sure that this is the same error that I see in my application, (that's the problem with simplified test cases) but it very well might be and it looks like problem behaviour.

    The reason that I think this might be a clue is that the second server in my application (LocalAbstractGenerator) is a huge slow application that could take a while to shut down. It looks like it might be the case that IceGrid has problems with applications that shut down slowly.
  • benoit
    benoit Rennes, France
    Hi Bruce,

    Thanks, I think I better understand what is happening now.

    IceGrid doesn't know when a process is being deactivated until it actually detects its termination. So if for some reasons it takes a long time for the process to exit after the communicator shutdown, clients might get Ice::NoEndpointException exceptions as you are seeing.

    When the communicator is shutdown, the object adapters are deactivated. Upon deactivation, each object adapter clears its endpoints from the IceGrid registry (that's when you see the "server `...' adapter `..' deactivated" messages). If a request on an indirect proxy from the deactivated object adapter is made at that time, IceGrid sees that the object adapter endpoints are null and tries to activate the server but since the server is still running it believes that the adapter activation actually timed out... that's when you see the adapter activation timeout messages.

    I still don't know why it takes so long for the node to detect the termination of your server. If the machine is heavily loaded when your server terminates or if it takes a long time for the process to terminate after main() returns (is your process perhaps consuming a lot of memory causing a lot of swapping on the disk?), it's possible that the node doesn't detect the process termination in a timely manner.

    One way to prevent the Ice::NoEndpointException exceptions to occur would be to have a notification of the server deactivation before the process actually terminates. This would allow to queue requests from clients until the server terminates and is re-activated. I'll add this on our TODO list!

    In the meantime, the best is probably to not use Ice.ServerIdleTime for your slow server. If its activation/deactivation takes a long time, it's probably best to keep it active all the time.

    Let me know if you have more questions about this!

    Cheers,
    Benoit.