Archived
This forum has been archived. Please start a new discussion on GitHub.
IceGrid on-demand activation with Ice.ServerIdleTime
in Help Center
I've been trying to use IceGrid to start servers on an on-demand basis with a time-out to get rid of unused servers. It's been working pretty well, except for one puzzling problem.
I have a server, RemoteAbstractServer-cl6 which contacts another server, LocalAbstractServer-cl6. Both are IceGrid on-demand servers. Both have Ice.ServerIdleTime set to 10 seconds.
When RemoteAbstractServer-cl6 is unused for 10 seconds, it times out. Then if I try to contact it again, it starts up but fails to contact LocalAbstractServer-cl6. What's really strange is that about 15 seconds later, everything is OK again. RemoteAbstractServer-cl6 can contact LocalAbstractServer-cl6. During that period, ps shows that LocalAbstractServer-cl6 is a defunct process.
The error message is:
The app.xml server descriptions are:
Any ideas why RemoteAbstractServer-cl6 would not be able to contact LocalAbstractServer-cl6 after a timeout? Why would it be a defunct process?
It's parent process is icegridnode. What would prevent icegridnode from wait()ing on it? Any workarounds?
I have a server, RemoteAbstractServer-cl6 which contacts another server, LocalAbstractServer-cl6. Both are IceGrid on-demand servers. Both have Ice.ServerIdleTime set to 10 seconds.
When RemoteAbstractServer-cl6 is unused for 10 seconds, it times out. Then if I try to contact it again, it starts up but fails to contact LocalAbstractServer-cl6. What's really strange is that about 15 seconds later, everything is OK again. RemoteAbstractServer-cl6 can contact LocalAbstractServer-cl6. During that period, ps shows that LocalAbstractServer-cl6 is a defunct process.
The error message is:
Exception in RemoteAbstractGenerator Ice.NoEndpointException proxy = "LocalAbstractGenerator -t @ LocalAbstractServer-cl6.LocalAbstractGeneratorAdapter"
The app.xml server descriptions are:
<server id="LocalAbstractServer-cl6" exe="./LocalAbstractGeneratorServer" pwd="bin" activation="on-demand"> <property name="Ice.ServerIdleTime" value="10"/> <adapter name="LocalAbstractGeneratorAdapter" register-process="true" endpoints="tcp"> </adapter> </server> <server id="RemoteAbstractServer-cl6" exe="java" pwd="bin" activation="on-demand"> <option>-jar</option> <option>RemoteAbstractGenerator.jar</option> <property name="Ice.ServerIdleTime" value="10"/> <adapter name="RemoteAbstractGeneratorAdapter" register-process="true" endpoints="tcp"> </adapter> </server>
Any ideas why RemoteAbstractServer-cl6 would not be able to contact LocalAbstractServer-cl6 after a timeout? Why would it be a defunct process?
It's parent process is icegridnode. What would prevent icegridnode from wait()ing on it? Any workarounds?
0
Comments
-
It looks like for some reason LocalAbstractServer-cl6 does not exit properly. I suspect that while it's "half-dead" (defunct) IceGrid (the icegridnode) does not detect that it's dead.
I would first check that you properly destroy [not just shutdown] the Ice communicator in this process. Also, which version of Ice do you use?
Cheers,
Bernard0 -
I put printfs/printlns in Both servers around the relevant procedure calls. Here's what I'm seeing:
LocalAbstractGeneratorServer: Got Shutdown
LocalAbstractGeneratorServer: calling Communicator::destroy
LocalAbstractGeneratorServer: return from Communicator::destroy
LocalAbstractGeneratorServer: done, returning 0
RemoteAbstractGenerator: Got shutdown
RemoteAbstractGenerator: calling Communicator::destroy
RemoteAbstractGenerator: return from Communicator::destroy
RemoteAbstractGenerator: exit with 0
So I think that the Communicator is getting destroyed in both servers.
One strange thing is that the zombie process eventually gets reaped. It just takes a long time - around 15 seconds.0 -
The problem occurs on icegridnode version 3.1.0 on a Gentoo system.0
-
Hi Bruce,
It shouldn't take so long for the IceGrid node to detect the process termination -- I don't know why you are seeing this defunct process for so long. I'm afraid I can't say what could be wrong wihtout more information though.
Could you run the icegridnode with the properties Ice.Logger.Timestamp=1, IceGrid.Node.Trace.Activator=2 and IceGrid.Node.Trace.Server=3 and post the trace output here once you reproduced the problem?
Also, are the 2 servers running on the same node or on different nodes?
Cheers,
Benoit.0 -
Btw, are you still by any chance using LinuxThreads on your Gentoo system? Ice doesn't support anymore LinuxThreads, only NPTL (among other things, there was some issues with waitpid() and LinuxThreads, see for example [thread=2472]this thread[/thread]).
To check the version of your pthread library you can try to the following command:
$ getconf "GNU_LIBPTHREAD_VERSION"
Cheers,
Benoit.0 -
Thank you for looking into this.
The problem occurs running on a single server. The problem occurs on an NPTL 2.4 system.
Here is the output of the first (successful) run with the requested tracing turned on:[ 10/09/06 09:31:07.558 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Activating' ] [ 10/09/06 09:31:07.558 icegridnode: Activator: activating server `RemoteAbstractServer-rack2' path = java pwd = bin uid/gid = 2193/100 args = java -jar RemoteAbstractGenerator.jar --Ice.Config=/etc/wink/Ice/node/servers/RemoteAbstractServer-rack2/config/config ] [ 10/09/06 09:31:07.563 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `WaitForActivation' ] In RemoteAbstractGenerator [ 10/09/06 09:31:08.039 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Active' ] In RemoteAbstractGenerator nrequests = 1 url: http://www.google.com/ query: google DbConnection::do_stmt_select_url_table before executeQuery ff90821feeb2b02a33a6f9fc8e5f3fcd DbConnection::do_stmt_select_url_table rs.first is true stmt_select_url_table returned true urlte.docuid 2818802 DbConnection::do_stmt_select_doc_table before executeQuery e9eabf20eee3312f1440b3ccc3b48518 DbConnection::do_stmt_select_doc_table rs.first is true stmt_select_doc_table returned true docte.path 2006/06/21/07/25/ The proxystring is LocalAbstractGenerator@LocalAbstractServer-rack2.LocalAbstractGeneratorAdapter [ 10/09/06 09:31:08.541 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `Activating' ] [ 10/09/06 09:31:08.541 icegridnode: Activator: activating server `LocalAbstractServer-rack2' path = ./LocalAbstractGeneratorServer pwd = bin uid/gid = 2193/100 args = ./LocalAbstractGeneratorServer --Ice.Config=/etc/wink/Ice/node/servers/LocalAbstractServer-rack2/config/config ] [ 10/09/06 09:31:08.544 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `WaitForActivation' ] Xlib: extension "XInputExtension" missing on display ":0.0". Failed to get list of devices [ 10/09/06 09:31:08.838 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `Active' ] [ 10/09/06 09:31:08.842 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `Activating' ] [ 10/09/06 09:31:08.842 icegridnode: Activator: activating server `LocalGetDocumentsServer-rack2' path = java pwd = bin uid/gid = 2193/100 args = java -jar LocalGetDocumentsServer.jar --Ice.Config=/etc/wink/Ice/node/servers/LocalGetDocumentsServer-rack2/config/config ] [ 10/09/06 09:31:08.844 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `WaitForActivation' ] In LocalGetDocuments [ LocalGetDocumentsServer-rack2: Network: accepting tcp connections at 127.0.0.1:38404 ] [ LocalGetDocumentsServer-rack2: Network: accepting tcp connections at 192.168.1.33:38405 ] [ LocalGetDocumentsServer-rack2: Network: accepting tcp connections at 192.168.1.32:38406 ] [ LocalGetDocumentsServer-rack2: Network: tcp connection established local address = 192.168.1.33:48108 remote address = 192.168.1.33:9018 ] [ LocalGetDocumentsServer-rack2: Network: tcp connection established local address = 192.168.1.32:53759 remote address = 192.168.1.32:38363 ] [ 10/09/06 09:31:09.193 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `Active' ] [ LocalGetDocumentsServer-rack2: Network: accepted tcp connection local address = 192.168.1.32:38406 remote address = 192.168.1.32:49722 ] secondsSinceEpoch=1150899906 2006-06-21 07:25:06 Xlib: extension "XInputExtension" missing on display ":0.0". Failed to get list of devices kbuildsycoca running... [ LocalGetDocumentsServer-rack2: Network: stopping to accept tcp connections at 127.0.0.1:38404 ] [ LocalGetDocumentsServer-rack2: Network: stopping to accept tcp connections at 192.168.1.33:38405 ] [ LocalGetDocumentsServer-rack2: Network: stopping to accept tcp connections at 192.168.1.32:38406 ] [ LocalGetDocumentsServer-rack2: Network: closing tcp connection local address = 192.168.1.32:38406 remote address = 192.168.1.32:49722 ] [ LocalGetDocumentsServer-rack2: Network: closing tcp connection local address = 192.168.1.33:48108 remote address = 192.168.1.33:9018 ] [ LocalGetDocumentsServer-rack2: Network: closing tcp connection local address = 192.168.1.32:53759 remote address = 192.168.1.32:38363 ] [ 10/09/06 09:31:19.364 icegridnode: Activator: detected termination of server `LocalGetDocumentsServer-rack2' ] [ 10/09/06 09:31:19.364 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `Deactivating' ] [ 10/09/06 09:31:19.364 icegridnode: Server: changed server `LocalGetDocumentsServer-rack2' state to `Inactive' ] LocalAbstractGeneratorServer: ActivatingLocalAbstractGeneratorServer: Waiting For Shutdownlocalhost: rack2 1 requests Request 0 localpath: 2006/06/21/07/25//2818802 query: google nitems = 1 LocalAbstractGeneratorServer: Got Shutdown[ 10/09/06 09:31:20.008 icegridnode: Activator: detected termination of server `RemoteAbstractServer-rack2' ] [ 10/09/06 09:31:20.008 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Deactivating' ] [ 10/09/06 09:31:20.008 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Inactive' ]
The second run fails with an Ice.NoEndpointException:[ 10/09/06 09:31:22.897 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Activating' ] [ 10/09/06 09:31:22.898 icegridnode: Activator: activating server `RemoteAbstractServer-rack2' path = java pwd = bin uid/gid = 2193/100 args = java -jar RemoteAbstractGenerator.jar --Ice.Config=/etc/wink/Ice/node/servers/RemoteAbstractServer-rack2/config/config ] [ 10/09/06 09:31:22.900 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `WaitForActivation' ] In RemoteAbstractGenerator [ 10/09/06 09:31:23.379 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Active' ] In RemoteAbstractGenerator nrequests = 1 url: http://www.google.com/ query: google DbConnection::do_stmt_select_url_table before executeQuery ff90821feeb2b02a33a6f9fc8e5f3fcd DbConnection::do_stmt_select_url_table rs.first is true stmt_select_url_table returned true urlte.docuid 2818802 DbConnection::do_stmt_select_doc_table before executeQuery e9eabf20eee3312f1440b3ccc3b48518 DbConnection::do_stmt_select_doc_table rs.first is true stmt_select_doc_table returned true docte.path 2006/06/21/07/25/ The proxystring is LocalAbstractGenerator@LocalAbstractServer-rack2.LocalAbstractGeneratorAdapter Exception in RemoteAbstractGenerator Ice.NoEndpointException proxy = "LocalAbstractGenerator -t @ LocalAbstractServer-rack2.LocalAbstractGeneratorAdapter" [ 10/09/06 09:31:33.937 icegridnode: Activator: detected termination of server `RemoteAbstractServer-rack2' ] [ 10/09/06 09:31:33.937 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Deactivating' ] [ 10/09/06 09:31:33.938 icegridnode: Server: changed server `RemoteAbstractServer-rack2' state to `Inactive' ] [ 10/09/06 09:31:40.010 icegridnode: Activator: detected termination of server `LocalAbstractServer-rack2' ] [ 10/09/06 09:31:40.010 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `Deactivating' ] [ 10/09/06 09:31:40.010 icegridnode: Server: changed server `LocalAbstractServer-rack2' state to `Inactive' ]
0 -
Hi Bruce,
Thanks for the traces but unfortunately I still don't know what's happening. All I can say is that for some reasons the LocalAbstractServer-rack2 server termination appears to be detected well after the "Got Shutdown" message.
I don't see the other messages that you mentioned in your first post ("calling Communicator::destroy", etc), are you sure that the server properly shuts down and terminate?
Could you run again your test with the following trace properties enabled on the node:Ice.Logger.Timestamp=1
IceGrid.Node.Trace.Activator=2
IceGrid.Node.Trace.Server=3
IceGrid.node.Trace.Adapter=3
And the following poperty for the IceGrid regsitry:IceGrid.Registry.Trace.Locator=3
Ideally, the easiest for me to investigate this problem would be a small self-contained test case. I tried to reproduce your issue with the IceGrid simple demo by introducing a dependency between 2 servers but it worked as expected.
Cheers,
Benoit.0 -
Hello, I've attached the test rerun with the requested trace properties. Also, I've restored the debugging print statements indicating waiting for shutdown, go t shutdown, destroying communicator and communicator destroyed.
One other possible clue. In this case, there are actually 3 servers and a client involved:
TestClient ->
RemoteAbstractGenerator ->
LocalAbstractGenerator ->
LocalGetDocumentServer
The adapter timeout messages at 10/09/06 12:56:04.743 look very interesting.0 -
A clue?
I put together a simple test case:
t1.java-> (Client)
l1.java ->
l2.cpp ->
l3.java
Unfortunately, the test case in its simplest form does not exhibit the problem. However, when I put in a delay between waitForShutdown and destroy in l2.cpp, I do got a consistent Ice.NoEndpointException. I'm not sure that this is the same error that I see in my application, (that's the problem with simplified test cases) but it very well might be and it looks like problem behaviour.
The reason that I think this might be a clue is that the second server in my application (LocalAbstractGenerator) is a huge slow application that could take a while to shut down. It looks like it might be the case that IceGrid has problems with applications that shut down slowly.0 -
Hi Bruce,
Thanks, I think I better understand what is happening now.
IceGrid doesn't know when a process is being deactivated until it actually detects its termination. So if for some reasons it takes a long time for the process to exit after the communicator shutdown, clients might get Ice::NoEndpointException exceptions as you are seeing.
When the communicator is shutdown, the object adapters are deactivated. Upon deactivation, each object adapter clears its endpoints from the IceGrid registry (that's when you see the "server `...' adapter `..' deactivated" messages). If a request on an indirect proxy from the deactivated object adapter is made at that time, IceGrid sees that the object adapter endpoints are null and tries to activate the server but since the server is still running it believes that the adapter activation actually timed out... that's when you see the adapter activation timeout messages.
I still don't know why it takes so long for the node to detect the termination of your server. If the machine is heavily loaded when your server terminates or if it takes a long time for the process to terminate after main() returns (is your process perhaps consuming a lot of memory causing a lot of swapping on the disk?), it's possible that the node doesn't detect the process termination in a timely manner.
One way to prevent the Ice::NoEndpointException exceptions to occur would be to have a notification of the server deactivation before the process actually terminates. This would allow to queue requests from clients until the server terminates and is re-activated. I'll add this on our TODO list!
In the meantime, the best is probably to not use Ice.ServerIdleTime for your slow server. If its activation/deactivation takes a long time, it's probably best to keep it active all the time.
Let me know if you have more questions about this!
Cheers,
Benoit.0