adapter not active across some boxes in deployment

dhogan · June 2008

We have an 'interesting' situation in our production grid deployment:

We have 3 central boxes, each hosting 2 servers: one for HA-IceStorm, and one for the 3 services(call them A, B and C) which constitute our application's functionality.

A, B, and C are in distinct replica-groups (call them ARG, BRG, and CRG).
ARG, BRG, and CRG have replica-group elements that look like this:
<replica-group id="ARG">
<load-balancing type="random" n-replicas="0"/>
</replica-group>
<replica-group id="BRG">
<load-balancing type="random" n-replicas="0"/>
</replica-group>
<replica-group id="CRG">
<load-balancing type="random" n-replicas="0"/>
</replica-group>

A few days ago(5/29), we had a bad router in our network, which generated a few hours of NoRouteToHost exceptions until it was addressed. Since then, all invocations (from a few hundred clients) have targeted one of the services (A) on only one of the 3 available central boxes. In other words, all three central boxes are up, and invocations against two of the 3 services (B and C) are distributed across all three central boxes, as expected. But all invocations against A are hitting only one of the central boxes. We can't figure out why.

Bumping up the logging on one of the clients (Ice.Trace.Location=2, Ice.Trace.Network=3, and Ice.Trace.Protocol=1) shows that the resolution of ARG returns all three central boxes. Likewise, running the icegridadmin and typing
adapter endpoints ARG
also indicates that A has an adapter deployed on on three boxes.

The logging indicates that when the client attempts to connect to A on two of the central boxes, an SSL connection is succesfully established. This is confirmed by SSL logs on the two central boxes. However, the call does not seem to get routed to A on these two central boxes, and thus the client closes the SSL connection exactly two seconds later, as we have the clients configured with:

Ice.Override.ConnectTimeout=2000

Connecting to the TCP endpoints yields the same behavior, so it is not an SSL issue.

The one thing which is different for service A is that its adapter as the
register-process="true"
whereas B and C both have them set to "false". I don't believe this is an issue, as only one of the adapters in a server should have register-process set to "true"(39.16.1 of the Ice-3.2.1 documentation).

Do you have any ideas on what could trigger this behavior? How could we fix it - there does not seem to be any way to bounce an individual adapter - only the whole server can be bounced. Is this correct? I'm thinking there is
an issue mapping servant A in the ActiveServantMap on the two failing central boxes, and that bouncing the whole server would address the issue (hopefully).

Thanks

Dirk

benoit · June 2008

Hi Dirk,

It sounds like service A is not servicing anymore requests on 2 of your central service boxes. From your description, it sounds like service A/B/C are IceBox services running within the same IceBox server. Is this correct?

Did you try connecting to one of the faulty service TCP endpoint with telnet to see if still accepts connections? (e.g.: telnet <ip> <port>). If the service still accepts connection, you should see the "IceP" message in your telnet client.

You could try deactivating the service A with the iceboxadmin "service stop A" command but if the service hangs because of a deadlock this command will also most likely hang. If you use the iceboxadmin command from Ice 3.2.1, you can connect to the IceBox server hosting service A by configuring the IceBox.InstanceName and IceBox.ServiceManager.Endpoints properties (or IceBox.InstanceName, Ice.Default.Locator and IceBox.ServiceManager.AdapterId properties).

If you suspect a deadlock in the service or a hang, the best way to investigate is to attach with the debugger to one of the faulty IceBox servers and dump the stack traces. However, this means that when you attach with the debugger, service B/C will also stop dispatching requests from clients.

Unless there's a good reason to collocate services in the same IceBox server, I would recommend to isolate each service in its own IceBox server to avoid this kind of issue.

Cheers,
Benoit.

dhogan · June 2008

Benoit-
Attempted the telnet resulted in 'IceP' for the live service, nothing for the other two.

I'm not sure what you mean by deadlock. I don't believe that it has anything to do with the other services deployed in the same container. Clients simply cannot connect to the adapters corresponding to service A on two of the boxes, even though they sucessfully do so for service B and C running on the same two boxes. A is not speaking to B or C in the failed invocation - the communication is soley from a client to A. The SSL connection from the client to the two boxes work fine, but the invocation simply does not seem to be routed from the Ice runtime that is validating the SSL connection, to the actual servant corresponding to service A. Instead this call seems to hang until the client times out the connection.

I am not sure what you mean by 'this kind of issue'. Certainly any container (even IceBox) should be able to host more than a single endpoint. Please explain.

Dirk

benoit · June 2008

Hi Dirk,

So it appears that the service A communicator doesn't accept new connections if you don't see the 'IceP' message.

This indicates that no more threads from the Ice server thread pool are available to accept new connections. If you want to figure out why service A doesn't accept new connections we need to find what these threads are doing. Does service A call on other Ice services? Does it perhaps call on services which were unreachable because of the routing issue? The best way to figure this out is to attach to the IceBox server and get a thread dump: the dump should show if there's a deadlock or if the server threads are stuck calling another service. The easiest way to get & save a thread dump with gdb is to do something like the following:

   $ script
   $ gdb icebox <pid>
   (gdb) thread apply all bt
   <thread dump> 
   (gdb) exit
   $ exit

The thread dump should get saved in the "typescript" file. Note that this can disrupt connections from clients to service B and C and will pause the process for the duration of the gdb session. It's also possible that this requires re-starting the IceBox server if there's a problem when you attach/detach to the IceBox server with the debugger.

I didn't meant that the deadlock involved B or C. Apparently B or C are working just fine and still accept new connections which is expected if they use a separate Ice communicator (and therefore a separate server thread pool). This is the IceBox default configuration unless you configure IceBox services to share a communicator (with the IceBox.UseSharedCommunicator property).

IceBox can indeed host multiple services and each service uses its own Ice communicator. However, when a service mis-behaves and requires the IceBox server to be restarted -- all the other services have to be re-started as well. This is why deploying each service within its own IceBox server can be more convenient in such situations.

Cheers,
Benoit.

dhogan · June 2008

Benoit-
We bounced the server on the two boxes, and service A came back.

I could not attach a debugger because I could not disrupt service B and C.

I still doubt that it was a thread issue - we have 100 max server threads configured. I would imagine that if invocations were rejected because our thread pool were exceeded, that logs would be generated. I went through our logs very carefully, and found nothing. Also we run a tight service business where customers get serious money back in the event of QoS violations, and would definitely know if 100 invocations went off in the weeds.

Dirk

benoit · June 2008

Yes, if you have set Ice.ThreadPool.Server.SizeMax=100 and didn't configure the Ice.ThreadPool.Server.SizeWarn property, Ice will by default print a warning when the server thread pool reaches 80 threads (see this FAQ). Note that these thread pool properties must be set in the service descriptor or service configuration file if your services don't use the IceBox shared communicator.

Cheers,
Benoit.

Archived

adapter not active across some boxes in deployment

Comments

Categories