Communication problems between Glacier2 router & PermissionVerifier & SessionManager

Nis Baggesen · February 2006

As the headline says I'm experiencing some communication problems between various parts of my Ice routing setup. There may be several problems here, but since I currently don't know where on ends and another begins, I'll just put it all in one posting.

The setup is several clients talking to some servers via a glacier router. Except fo the clients endpoint at the Glacier router, all servers know each other through a locator service handled by a IceGrid. However most of the servers are not hosted on the IceGrid, but simply register their adapter dynamically. Everything is running on Ice-3.0.0. The clients are running on windows an the servers are running on Fedora Core 4 Linux.

The Glacier2 router uses the same service (called AccountManager) both as its PermissionsVerifier and SessionManager. All sessions (Which work as facades for the client) are currently registered on the same adapter as the interface to the AccountManager.

Now for the problems.

Sometimes the Glacier2 router will contact the PermissionsVerifier and (As far as our logs tell us) get the go ahead to create a session. But the SessionManager is never called, so no session is created. Instead the client symply hangs.

Other times a similar thing happens, only there is an extreme delay (On the order of 15 minutes) before the SessionManager is contacted.

And once it seemed like a call to the SessionManager was being replayed several times. Different accounts would have their permissions verified, but the same user name would be contacting the SessionManager.

We have also had a problem where Sessions never seemd to time out. The session timeout was set for 60 seconds, but sessions would stay alive for 10 minutes or more, semingly because the Glacier2 router never contacted the SessionManager.

Possibly related issues.

The IceGrid reports that some adapters get an activation timeout. Don't think this is too pertinent, as it doesn't directly involve the AccountManager.

The Glacier2 router reports various dispatch exceptions, such as :

operation: changeMode
glacier2router: warning: dispatch exception: TcpTransceiver.cpp:285: Ice::ConnectionLostException:
connection lost: Connection reset by peer
identity: A6Rp^9]@Co;[f[&\\M{Yh/HEART_RADLOCKBDOORS_669a3efe-c684-430f-84cc-f1ec575f06a1
facet:
operation: changeMode
glacier2router: warning: dispatch exception: ConnectionI.cpp:2030: Ice::CloseConnectionException:
protocol error: connection closed
identity: &Hlm~y?;m5v;98]D2E#l/HEART_RADLOCKBDOORS_ae7e12c7-292c-41f9-9c9a-a35f82a6bd55
facet:
operation: changeMode

However they all seem to be related to reverse routings back to clients that have crashed.

The AccountManager (The service responcible for permission verification and session management) occasionally reports thread allocation errors like so:

./unix/accountmanager: error: cannot create thread for
`Ice.ThreadPool.Server':
Thread.cpp:551: IceUtil::ThreadSyscallException:
thread syscall exception: Cannot allocate memory

However it seems to carry on running, and it doesn't consistently happen. These problems sometimes happen after the server setup has been running for hours, but once it starts happening it starts being persistent even after full server setup restart, and often lock up within 10 minutes. That may have to do with logon patterns or system resources not being freed properly I don't know.

If it would be useful I can attach traces, and enable more tracing on the various servers, but since they may have to run for hours before reproducing the bug, I'd like to have at least some hints which services you think it would be useful to trace and log.

Most calls in the system are oneway calls. All connection management is disabled both client and serverside in all ends, and retries are set to -1. The Glacier router is currently not setup to batch communication. I'd be happy to provide configuration files, but since there are quite a few for the various servers, I'd like to know which ones would be interesting before posting them all.

matthew · February 2006

Hi Nils,

It looks to me like your account manager has an issue. If the session is created 15 minutes after the permissions verifier call, its most likely that this is caused by the permissions verifier taking forever to respond. Furthermore the fact that your account manager is reporting memory allocation problem is more evidence to it being the source of your headaches.

I would look closer at the load you are imposing on this server. What is the thread pool configuration of the server? Most likely, this "cannot create thread" error is in creating addition threads once all threads in the server side thread pool are exhausted. What is the setting of of the server thread pool size, the size max and the size warn? Are you getting warnings emitted
in the account manager?

Warning out(_instance->logger());
out << "thread pool `" << _prefix << "' is running low on threads\n"
     << "Size=" << _size << ", " << "SizeMax=" << _sizeMax << ", " << "SizeWarn=" << _sizeWarn;

Have you tried attaching a debugger to the account manager server when you see the memory allocation error and seeing what all the threads are doing? This would most likely help you locate the source of your problems.

Nis Baggesen · February 2006

matthew wrote:

Hi Nils,
It looks to me like your account manager has an issue. If the session is created 15 minutes after the permissions verifier call, its most likely that this is caused by the permissions verifier taking forever to respond. Furthermore the fact that your account manager is reporting memory allocation problem is more evidence to it being the source of your headaches.

I would agree with that assesment.

matthew wrote:
I would look closer at the load you are imposing on this server. What is the thread pool configuration of the server? Most likely, this "cannot create thread" error is in creating addition threads once all threads in the server side thread pool are exhausted. What is the setting of of the server thread pool size, the size max and the size warn? Are you getting warnings emitted
in the account manager?
Warning out(_instance->logger());
out << "thread pool `" << _prefix << "' is running low on threads\n"
     << "Size=" << _size << ", " << "SizeMax=" << _sizeMax << ", " << "SizeWarn=" << _sizeWarn;

No, we are not getting a thread pool warnings from the AccountManager. Or rather we did when the thread pool had a max size of 5. We have since increased it to 1000 (There didn't seem to be any guidelines for thread pool sizes in the Ice manual, so having aimed low on the first try, we decided to aim high on the next), and no longer get any of those warnings. We could reduce the thread pool size I guess - Presumably then we would at least be getting a warning before we get allocation errors. Any idea what would cause an allocation error, when the machine seems to have a lot of memory. Must admit that pthreads are not fresh in my memory, so I don't know what other resources they may reserver.

Would it have any effect to use different adapters for the Sessions and for AccountManager? Appart from total machine load does it then have any significanse that all the services are currently running on the same machine. The machine only has a ~20% CPU load and ~30% memory load, so in total it is not heavily loaded. Nor is the network load anywhere near the limit of the machine or the network it is on. But the AccountManager may be under som strain, as all client communication goes through their Sessions (their facades). I don't really know how to reduce the load here, as we do need the Facades. Would it be any help to use another adapter for the sessions? Even if it wouldn't relieve load, would it make it easier to track where the stress is occuring ? Would it make sense to use direct rather than indirect proxies for the internal server connections - Just to remove a middleman in the proxy resolution?

matthew wrote:

Have you tried attaching a debugger to the account manager server when you see the memory allocation error and seeing what all the threads are doing? This would most likely help you locate the source of your problems.

No, we haven't done that yet. I was initially suspecting the Glacier router, since that was the service which suddenly dropped the connection, and I didn't know what it was supposed to be doing, so debugging it seemed a bit daunting. I'll try debugging. I assume I should be looking for threads that for some reasons haven't be freed and returned to the thread pool. How would that look? All the call I can see the AccountManager handling seem to complete just fine (going by the output in the log), and most of what the Sessions are doing, is simply to dispatch oneway calls along the correct proxies on the servers, so I can't see why they would take up a lot of resources. Will threads or connections have a tendency to hang around if the client end crashes? Is there a timeout which should be setup to collect those kinds of resources.

benoit · February 2006

Hi,

Nis Baggesen wrote:

I would agree with that assesment.

No, we are not getting a thread pool warnings from the AccountManager. Or rather we did when the thread pool had a max size of 5. We have since increased it to 1000 (There didn't seem to be any guidelines for thread pool sizes in the Ice manual, so having aimed low on the first try, we decided to aim high on the next), and no longer get any of those warnings. We could reduce the thread pool size I guess - Presumably then we would at least be getting a warning before we get allocation errors. Any idea what would cause an allocation error, when the machine seems to have a lot of memory. Must admit that pthreads are not fresh in my memory, so I don't know what other resources they may reserver.

Threads might consume a lot of memory depending on what is the default stack size for each thread for your OS (I believe on Linux this is the value returned by "ulimit -s"). You can tune the stack size with the Ice.ThreadPool.Server.StackSize property but I think 1000 threads might be too much unless most of these threads are mostly always inactive. Why did you have to bump the number of threads for the thread pool in the first place? Did the problem started to show up when you've changed this configuration? Do you have an idea on how many threads are getting created in your server? (you could set the thread pool SizeWarn property to a much lower limit, 50 for example, to see if you reach this limit).

Would it have any effect to use different adapters for the Sessions and for AccountManager? Appart from total machine load does it then have any significanse that all the services are currently running on the same machine. The machine only has a ~20% CPU load and ~30% memory load, so in total it is not heavily loaded. Nor is the network load anywhere near the limit of the machine or the network it is on. But the AccountManager may be under som strain, as all client communication goes through their Sessions (their facades). I don't really know how to reduce the load here, as we do need the Facades. Would it be any help to use another adapter for the sessions? Even if it wouldn't relieve load, would it make it easier to track where the stress is occuring ? Would it make sense to use direct rather than indirect proxies for the internal server connections - Just to remove a middleman in the proxy resolution?

Yes, you could try to use a different adapter with its own thread pool for the sessions. This way, if the issue is caused by a problem with your session objects (e.g.: calls not returning and causing some threads from the thread pool to be consumed for a long time), the rest of the functionality of your server shouldn't be affected.

Using direct vs. indirect proxies should have only a very small impact on performances as a cache is used by the Ice runtime to minimize calls on the Ice location service. For example, if you invoke on a proxy dummy@MyAdapterId, the Ice client runtime will get the endpoints for the adapter "MyAdapterId" from the location service and will cache these endpoints. Further invocations on an indirect proxy with the "MyAdapterId" adapter id will be done directly using the endpoints retrieved from the cache.

No, we haven't done that yet. I was initially suspecting the Glacier router, since that was the service which suddenly dropped the connection, and I didn't know what it was supposed to be doing, so debugging it seemed a bit daunting. I'll try debugging. I assume I should be looking for threads that for some reasons haven't be freed and returned to the thread pool. How would that look? All the call I can see the AccountManager handling seem to complete just fine (going by the output in the log), and most of what the Sessions are doing, is simply to dispatch oneway calls along the correct proxies on the servers, so I can't see why they would take up a lot of resources. Will threads or connections have a tendency to hang around if the client end crashes? Is there a timeout which should be setup to collect those kinds of resources.

No, threads or connections shouldn't hang around if a client crashes.

You session objects forward requests from your clients to other backend servers. Is it possible that the rate of incoming requests from your clients be too high and your backend servers can't keep up? If that's the case, it's possible that your session objects might end up creating lots of threads to try to keep up with the incoming requests. But in the end this brings the server to a crawl because too many threads cause lots of overhead -- especially if all these threads are runnable (memory allocation, thread context switches, etc).

If your session objects just forward requests to your backend servers, I would recommend to use a fixed size thread pool instead.

In any case, the best way to investiage these problems is to attach to the suspicious processes with the debugger when you see that your application starts to hang. By going through the stack of the different threads, you should see what eventually hangs (calls to your backend objects, incoming calls on the session objects, etc).

Cheers,
Benoit.

matthew · February 2006

I would also add that if you are forwarding lots of twoway invocations from your session facade to the backend you should use chained AMI and AMD to avoid holding threads in the facade. You can see a concrete example of what I'm talking about in the asynchronous invocations connections article that I wrote.

Nis Baggesen · February 2006

matthew wrote:

I would also add that if you are forwarding lots of twoway invocations from your session facade to the backend you should use chained AMI and AMD to avoid holding threads in the facade. You can see a concrete example of what I'm talking about in the asynchronous invocations connections article that I wrote.

Thanks - We do have some chained twoway invocations, and I'll try cleaning up those. Nice to have an article to look at.

Nis Baggesen · February 2006

benoit wrote:

Hi,

Threads might consume a lot of memory depending on what is the default stack size for each thread for your OS (I believe on Linux this is the value returned by "ulimit -s"). You can tune the stack size with the Ice.ThreadPool.Server.StackSize property but I think 1000 threads might be too much unless most of these threads are mostly always inactive. Why did you have to bump the number of threads for the thread pool in the first place? Did the problem started to show up when you've changed this configuration? Do you have an idea on how many threads are getting created in your server? (you could set the thread pool SizeWarn property to a much lower limit, 50 for example, to see if you reach this limit).

The thread pool was increased because the 5 threads we started out with was obviously too conservative. It was increased to 1000, as a sort of first step in a seach procudere. We aimed to low at first, then we tried to see what effect it had to aim high. We will try setting it at a more reasonable level.

benoit wrote:

Yes, you could try to use a different adapter with its own thread pool for the sessions. This way, if the issue is caused by a problem with your session objects (e.g.: calls not returning and causing some threads from the thread pool to be consumed for a long time), the rest of the functionality of your server shouldn't be affected.

Great. I'll try doing that then.

benoit wrote:

Using direct vs. indirect proxies should have only a very small impact on performances as a cache is used by the Ice runtime to minimize calls on the Ice location service. For example, if you invoke on a proxy dummy@MyAdapterId, the Ice client runtime will get the endpoints for the adapter "MyAdapterId" from the location service and will cache these endpoints. Further invocations on an indirect proxy with the "MyAdapterId" adapter id will be done directly using the endpoints retrieved from the cache.

I figured it wouldn't make much difference, but just wanted to check.

benoit wrote:

No, threads or connections shouldn't hang around if a client crashes.

You session objects forward requests from your clients to other backend servers. Is it possible that the rate of incoming requests from your clients be too high and your backend servers can't keep up? If that's the case, it's possible that your session objects might end up creating lots of threads to try to keep up with the incoming requests. But in the end this brings the server to a crawl because too many threads cause lots of overhead -- especially if all these threads are runnable (memory allocation, thread context switches, etc).

It is possible that there are too many requests for the server to handle, although as I said the machine isn't complaining about the load, so it wasn't really my first suspicion.

benoit wrote:

If your session objects just forward requests to your backend servers, I would recommend to use a fixed size thread pool instead.

Ok - I'll try that. To set a fixed size do I just need to the the Size or should i set MaxSize and SizeWarn to the same number ?

benoit wrote:

In any case, the best way to investiage these problems is to attach to the suspicious processes with the debugger when you see that your application starts to hang. By going through the stack of the different threads, you should see what eventually hangs (calls to your backend objects, incoming calls on the session objects, etc).

Nis Baggesen · February 2006

matthew wrote:

I would also add that if you are forwarding lots of twoway invocations from your session facade to the backend you should use chained AMI and AMD to avoid holding threads in the facade. You can see a concrete example of what I'm talking about in the asynchronous invocations connections article that I wrote.

Btw. Is there any reason to use AMI and AMD for the oneway calls as well ? I assume not, but just want to be sure.

Nis Baggesen · February 2006

benoit wrote:

In any case, the best way to investiage these problems is to attach to the suspicious processes with the debugger when you see that your application starts to hang. By going through the stack of the different threads, you should see what eventually hangs (calls to your backend objects, incoming calls on the session objects, etc).

I'll get on the debugger, but it'll take a short while to set up a pracical way of attaching it to our server in the US, so I'll just bug you some more with a few protocol traces.

When the session login goes according to plan it looks like :

[ ./unix/accountmanager: Protocol: received request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 101
  request id = 4
  identity = AccountManager
  facet = 
  operation = checkPermissions
  mode = 1 (nonmutating)
  context =  ]
AccountManagerImpl::checkPermissions - start: 
[ ./unix/accountmanager: Protocol: sending request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 65
  request id = 3
  identity = WorldDataManager
  facet = 
  operation = loadAccount
  mode = 1 (nonmutating)
  context =  ]
[ ./unix/accountmanager: Protocol: received reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 329
  request id = 3
  reply status = 0 (ok) ]
AccountManagerImpl::checkPermissions - end well
[ ./unix/accountmanager: Protocol: sending reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 39
  request id = 4
  reply status = 0 (ok) ]
[ ./unix/accountmanager: Protocol: received request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 58
  request id = 5
  identity = AccountManager
  facet = 
  operation = create
  mode = 0 (normal)
  context =  ]
AccountManagerImpl::create (session) for user id fenryll - start 
[ ./unix/accountmanager: Protocol: sending request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 65
  request id = 4
  identity = WorldDataManager
  facet = 
  operation = loadAccount
  mode = 1 (nonmutating)
  context =  ]
[ ./unix/accountmanager: Protocol: received reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 329
  request id = 4
  reply status = 0 (ok) ]
GameClientFacadeImpl::setThisProxy
AccountManagerImpl::created session 1 for id fenryll5BDFB311-355D-43BC-B51C-30F794C72B1B with proxy fenryll5BDFB311-355D-43BC-B51C-30F794C72B1B -t @ AccountManager
AccountManagerImpl::create (session) - end 
[ ./unix/accountmanager: Protocol: sending reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 89
  request id = 5
  reply status = 0 (ok) ]

And the corresponding Glacier2 trace is :

[ glacier2router: Protocol: received request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 98
  request id = 2
  identity = Glacier2/router
  facet = 
  operation = createSession
  mode = 0 (normal)
  context =  ]
[ glacier2router: Protocol: sending request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 101
  request id = 4
  identity = AccountManager
  facet = 
  operation = checkPermissions
  mode = 1 (nonmutating)
  context =  ]
[ glacier2router: Protocol: received reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 39
  request id = 4
  reply status = 0 (ok) ]
[ glacier2router: Protocol: sending request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 58
  request id = 5
  identity = AccountManager
  facet = 
  operation = create
  mode = 0 (normal)
  context =  ]
[ glacier2router: Protocol: received reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 89
  request id = 5
  reply status = 0 (ok) ]
[ glacier2router: Glacier2: created session
  user-id = fenryll
  category = hM-nR^;P-a+|mR6[a|+1
  local address = 208.64.64.66:10005
  remote address = 82.123.64.132:3627 ]
[ glacier2router: Protocol: sending reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 89
  request id = 2
  reply status = 0 (ok) ]

When things go wrong the AccountManager says:

[ ./unix/accountmanager: Protocol: received request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 105
  request id = 1293
  identity = AccountManager
  facet = 
  operation = checkPermissions
  mode = 1 (nonmutating)
  context =  ]
AccountManagerImpl::checkPermissions - start: <metalpuppet>
[ ./unix/accountmanager: Protocol: sending request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 69
  request id = 326
  identity = WorldDataManager
  facet = 
  operation = loadAccount
  mode = 1 (nonmutating)
  context =  ]
[ ./unix/accountmanager: Protocol: received reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 341
  request id = 326
  reply status = 0 (ok) ]
AccountManagerImpl::checkPermissions - end well
[ ./unix/accountmanager: Protocol: sending reply
  message type = 2 (reply)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 39
  request id = 1293
  reply status = 0 (ok) ]

And glacier says

[ glacier2router: Protocol: received request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 102
  request id = 2
  identity = Glacier2/router
  facet = 
  operation = createSession
  mode = 0 (normal)
  context =  ]
[ glacier2router: Protocol: sending request
  message type = 0 (request)
  compression status = 0 (not compressed; do not compress response, if any)
  message size = 105
  request id = 1290
  identity = AccountManager
  facet = 
  operation = checkPermissions
  mode = 1 (nonmutating)
  context =  ]

So the as far as I can see from the trace, the AccountManager is returning just fine from the permission verification, but never gets asked to create a session.

Glacier continues to handle other routing on for the created Session proxies but doesn't seem to follow up on the Session creation it was asked to start.

Neither glacier nor the account manager show any delays in responding to the createSession / checkPermissions call, and all the session that have been created are responding without delays.

I tried reducing the thread pool warning size to 50, and neither log has any warnings. The server isn't stressed in any ways, and this time the problem appeared with just 5 connecting clients, even though we have succesfully had 30 connections (Using the same setup except for the addition of tracing, and the lower setting for thread pool warnings) without seeing the problem.

benoit · February 2006

Nis Baggesen wrote:

The thread pool was increased because the 5 threads we started out with was obviously too conservative. It was increased to 1000, as a sort of first step in a seach procudere. We aimed to low at first, then we tried to see what effect it had to aim high. We will try setting it at a more reasonable level.

Using 5 threads might be perfectly fine if you're just forwarding requests with oneway or twoway AMI calls because there will be only little wait in the server thread pool. That is, these 5 threads will be runnable most of the time.

It's interesting to increase the number of threads if the method being dispatched have to wait for some time (wait for a twoway call to return, wait for some resources, etc) and if you see that incoming calls are being delayed because there's no more threads available in the server thread pool.

Ok - I'll try that. To set a fixed size do I just need to the the Size or should i set MaxSize and SizeWarn to the same number ?

You can just set the Size property. By default, if it's not specified, SizeMax will have the value of the Size property (i.e.: fixed size thread pool).

Benoit.

benoit · February 2006

Nis Baggesen wrote:

Btw. Is there any reason to use AMI and AMD for the oneway calls as well ? I assume not, but just want to be sure.

Correct, AMI/AMD is only useful for twoway calls -- you don't need to use it for oneway calls.

Cheers,
Benoit.

Nis Baggesen · February 2006

benoit wrote:

Using 5 threads might be perfectly fine if you're just forwarding requests with oneway or twoway AMI calls because there will be only little wait in the server thread pool. That is, these 5 threads will be runnable most of the time.

It's interesting to increase the number of threads if the method being dispatched have to wait for some time (wait for a twoway call to return, wait for some resources, etc) and if you see that incoming calls are being delayed because there's no more threads available in the server thread pool.

You can just set the Size property. By default, if it's not specified, SizeMax will have the value of the Size property (i.e.: fixed size thread pool).

Benoit.

Well we did have some twoway calls that could take some time to finish (Database reads etc.) and since it wasn't handled using AMD and AMI these propably resulted in some holdups in the AccountManager. At least we where seeing some long delays (Though not complete failure to communicate, as we sometimes se know) when we where just using the size 5 thread pool. Of oucrse that may be less of a problem if I get it rewritten to use AMD and AMI for the twoway calls.

benoit · February 2006

So the as far as I can see from the trace, the AccountManager is returning just fine from the permission verification, but never gets asked to create a session.

Right, from these traces, it looks like Glacier2 doesn't actually receives or reads the reply from the checkPermissions request it sent. The only reason why Glacier2 wouldn't read this reply is that all the threads from its client thread pool are blocked on something (which shouldn't happen). Can you attach the debugger to the glacier2router process and get a dump of the thread stacks? Could you also post the configuration of your Glacier2 service?

Glacier continues to handle other routing on for the created Session proxies but doesn't seem to follow up on the Session creation it was asked to start.

Neither glacier nor the account manager show any delays in responding to the createSession / checkPermissions call, and all the session that have been created are responding without delays.

I tried reducing the thread pool warning size to 50, and neither log has any warnings. The server isn't stressed in any ways, and this time the problem appeared with just 5 connecting clients, even though we have succesfully had 30 connections (Using the same setup except for the addition of tracing, and the lower setting for thread pool warnings) without seeing the problem.

Once a createSession call hangs, can other clients still create sessions?

Benoit.

Nis Baggesen · February 2006

benoit wrote:

Right, from these traces, it looks like Glacier2 doesn't actually receives or reads the reply from the checkPermissions request it sent. The only reason why Glacier2 wouldn't read this reply is that all the threads from its client thread pool are blocked on something (which shouldn't happen). Can you attach the debugger to the glacier2router process and get a dump of the thread stacks? Could you also post the configuration of your Glacier2 service?

As I said, it'll take me a while to setup remote debugging for our american servers and we haven't been able to reproduce the bug inhouse. When I get it up and running I'll try to attach a debugger to the Glacier2 service.

Our configuration settings for the Glacier service are :

#Ice.Default.Host=192.168.0.31
Ice.Default.Locator=IceGrid/Locator:tcp -p 5000

# We must set the stack size of new threads created by Glacier2. The
# default on Linux is typically in the 10MB range, which is way too
# high.
#
# Since Glacier2 always uses thread-per-connection mode, we must use
# the property below to set the thread stack size. Internal Glacier2
# threads also use this property value.

Ice.ThreadPerConnection.StackSize=262144

# The client-visible endpoint of Glacier2. This should be an endpoint
# visible from the public Internet, and it should be secure.
#should be ssl - but that is for later
#Glacier2.Client.PublishedEndpoints=tcp -h 83.91.134.45 -p 10005
Glacier2.Client.Endpoints=tcp -p 10005


# The server-visible endpoint of Glacier2. This endpoint is only
# required if callbacks are needed (leave empty otherwise). This
# should be an endpoint on an internal network (like 192.168.x.x), or
# on the loopback, so that the server is not directly accessible from
# the Internet.

Glacier2.Server.Endpoints=tcp 

# The configures the session manager. If no external session manager
# is used, sessions are only handled Glacier2 internally.
Glacier2.SessionManager=AccountManager:default -p 6500

# For this demo, we use a dummy permissions verifier that is
# collocated with the session server process. This dummy permissions
# verifier allows any user-id / password combination.

Glacier2.PermissionsVerifier=AccountManager:default -p 6500

# The timeout for inactive sessions. If any client session is inactive
# for longer than this value, the session expires and is removed. The
# unit is seconds.
#
#Glacier2.SessionTimeout=20000
Glacier2.SessionTimeout=60


# Glacier can forward requests buffered or unbuffered. Unbuffered
# means a lower resource consumption, as buffering requires one
# additional thread per connected client or server. However, without
# buffering, messages cannot be batched and message overriding doesn't
# work either. Also, with unbuffered request forwarding, the caller
# thread blocks for twoway requests.
#
#Glacier2.Client.Buffered=1
#Glacier2.Server.Buffered=1


# These two lines instruct Glacier2 to forward contexts both for
# regular routing, as well as for callbacks (reverse routing).
#
Glacier2.Client.ForwardContext=1
Glacier2.Server.ForwardContext=1

# To prevent Glacier2 from being flooded with requests from or to one
# particular client, Glacier2 can be configured to sleep for a certain
# period after all current requests for this client have been
# forwarded. During this sleep period, new requests for the client are
# queued. These requests are then all sent once the sleep period is
# over. The unit is milliseconds.

#Glacier2.Client.SleepTime=500
#Glacier2.Server.SleepTime=500

# With the two settings below, Glacier2 can be instructed to always
# batch oneways, even if they are sent with a _fwd/o instead of a
# _fwd/O context.
#
#Glacier2.Client.AlwaysBatch=0
#Glacier2.Server.AlwaysBatch=0


# Glacier2 always disables active connection management so there is no
# need to configure this manually. Connection retry does not need to
# be disabled, as it's safe for Glacier2 to retry outgoing connections
# to servers. Retry for incoming connections from clients must be
# disabled in the clients.
#


# Various settings to trace requests, overrides, etc.
#
Glacier2.Client.Trace.Request=1
Glacier2.Server.Trace.Request=1
Glacier2.Client.Trace.Override=1
Glacier2.Server.Trace.Override=1
Glacier2.Client.Trace.Reject=1
Glacier2.Trace.Session=1
Glacier2.Trace.RoutingTable=1


# Other settings.

Ice.Trace.Network=1
Ice.Trace.Protocol=1
Ice.Warn.Connections=1
Ice.MessageSizeMax=2048

#Ice.Plugin.IceSSL=IceSSL:create
#IceSSL.Client.CertPath=../../../certs
#IceSSL.Client.Config=sslconfig.xml
#IceSSL.Server.CertPath=../../../certs
#IceSSL.Server.Config=sslconfig.xml
#IceSSL.Trace.Security=1

Most of it is just taken directly from the Glacier2 examples you have provided. Some of it may be spurious, so if anything look completely wrong please tell me.

benoit wrote:

Once a createSession call hangs, can other clients still create sessions?

Usually once it starts not creating sessions, no new sessions will be created. I do think that I once saw it resume session creation, but that may just have been me misreading a log. Is Glacier keeping any kind of internal databse (Like the topic manager or ice grid)? Because once the problems start they seem to persist even if we restart the servers. But if it gets a long rest it sometimes works fine for hours, even with many clients connecting and disconnecting.

Another thing. Once the system starts to reject making new sessions, old sessions also seem to stop timing out. They can still be ended manually though.

benoit · February 2006

I said earlier that Glacier2 could stop reading answers if all the threads from its client thread pool were blocked. That's actually not correct since Glacier2 doesn't use any thread pools -- it's only using thread per connection. I'm afraid I really don't see any reasons why Glacier2 wouldn't read the reply from your account manager server (unless something goes wrong with the connection between your server and Glacier2 -- but I suspect we would see some warnings if this was the case).

Nis Baggesen wrote:

Most of it is just taken directly from the Glacier2 examples you have provided. Some of it may be spurious, so if anything look completely wrong please tell me.

I can't see anything wrong with this configuration, server buffering is correctly enabled (i.e.: you didn't set Glacier2.Server.Buffered to 0) so even if something goes wrong with a client it shouldn't affect other clients.

Usually once it starts not creating sessions, no new sessions will be created. I do think that I once saw it resume session creation, but that may just have been me misreading a log. Is Glacier keeping any kind of internal databse (Like the topic manager or ice grid)? Because once the problems start they seem to persist even if we restart the servers. But if it gets a long rest it sometimes works fine for hours, even with many clients connecting and disconnecting.

No, Glacier2 doesn't keep any databases, its state is totally transient.

Another thing. Once the system starts to reject making new sessions, old sessions also seem to stop timing out. They can still be ended manually though.

Do you mean that old sessions aren't getting destroyed anymore automatically by Glacier2 if the client vanished without destroying the session explicitly?

Cheers,
Benoit.

Nis Baggesen · February 2006

benoit wrote:

I said earlier that Glacier2 could stop reading answers if all the threads from its client thread pool were blocked. That's actually not correct since Glacier2 doesn't use any thread pools -- it's only using thread per connection. I'm afraid I really don't see any reasons why Glacier2 wouldn't read the reply from your account manager server (unless something goes wrong with the connection between your server and Glacier2 -- but I suspect we would see some warnings if this was the case).

I haven't been able to find any connection warnings that seemed related to the connection between Glacier2 and the AccountManager, but I may have overlooked them. As far as i know I've disabled all the relevant tracing (Network, Protocol and Warn Connection). I'll try to see if I can find anything in the logs. Would it be possible for a connection error to happen some time before the symptom appears, or should it be immediately related to the calls?

I might be able to setup some extra communication tracing by placing the AccountManager on a different machine, and the use Ethereal to look at the communication.

If I want to debug the Glacier2 process what would be the relevant methods to insert watches and breakpoints in ?

benoit wrote:

I can't see anything wrong with this configuration, server buffering is correctly enabled (i.e.: you didn't set Glacier2.Server.Buffered to 0) so even if something goes wrong with a client it shouldn't affect other clients.

As I said it was stolen almost completely from your examples, so it should be relatively correct. I may want to explicitly enable the server buffering again, just to make it clearer to myself what happens.

benoit wrote:

No, Glacier2 doesn't keep any databases, its state is totally transient.

Ok. Hadn't found any signs of it either, I just wanted to check. In that case it is propably just a case of some system resources (Sockets and the like) not always being freed during an automatic restart.

benoit wrote:

Do you mean that old sessions aren't getting destroyed anymore automatically by Glacier2 if the client vanished without destroying the session explicitly?

Yes. Crashed clients don't have their sessions destroyed automatically by Glacier. The timeout is set to 60 seconds, but once the trouble starts sessions can survive for 10 minutes or more - We eventually just restart the AccountManager, so we don't know if they ever time out.

On the subject of destroying sessions, if I want to destroy a session behind the back of a client, I assume I can simply call destroy on it in the AccountManager ? And then Glacier wil figure it out next time the client tries to use the session, if not earlier.

benoit · February 2006

If I want to debug the Glacier2 process what would be the relevant methods to insert watches and breakpoints in ?

The best is to send us the stack traces, it's in general enough to see where the hangs occur.

Yes. Crashed clients don't have their sessions destroyed automatically by Glacier. The timeout is set to 60 seconds, but once the trouble starts sessions can survive for 10 minutes or more - We eventually just restart the AccountManager, so we don't know if they ever time out.

On the subject of destroying sessions, if I want to destroy a session behind the back of a client, I assume I can simply call destroy on it in the AccountManager ? And then Glacier wil figure it out next time the client tries to use the session, if not earlier.

Yes, this is fine.

I've been looking at the possible reasons for your problem and I suspect that one recent Glacier2 fix might be responsible here. I'm preparing a source patch for Glacier2 for you to try, stay tuned!

Cheers,
Benoit.

benoit · February 2006

Nis,

Could you try out the patch attached to this post? I'm hopeful it will fix your issue. Note however that this patch won't allow you to make nested twoway calls through the Glacier2 router. However I suspect this shouldn't be a problem with your application if it used to work fine with Glacier2 from Ice version < 2.1.2. We're looking into providing a better fix for this issue.

To apply the patch:

$ cd Ice-3.0.1 (it should also work with 3.0)
$ patch -p0 < glacier2.patch.txt

Let us know how it goes after applying this patch.

Cheers,
Benoit.

Nis Baggesen · February 2006

benoit wrote:

Nis,

Could you try out the patch attached to this post? I'm hopeful it will fix your issue. Note however that this patch won't allow you to make nested twoway calls through the Glacier2 router. However I suspect this shouldn't be a problem with your application if it used to work fine with Glacier2 from Ice version < 2.1.2. We're looking into providing a better fix for this issue.

To apply the patch:

$ cd Ice-3.0.1 (it should also work with 3.0)
$ patch -p0 < glacier2.patch.txt

Let us know how it goes after applying this patch.

We will start patching and recompiling and report back. I can't say when that it'll be as it will take some time to recompile, and it is an intermitten error in the first place. But we will be back as soon as we feel able so say something useful. I'm also informed that we have had some issues compiling natively on our 64 bit Xeons on our US servers, so we may need to do some ironing out.

Regarding the nested twoway calls, what will that imply exactly? The Facade receives twoway calls and makes twoway calls through to the actual servers. However the calls from the Facade to the servers are not routed. Are this the kind of calls that will be impossible.

Or do you mean that I can't call twoway back and forth over the routed connection?

If it is the first case would it help to handle the twoways calls via AMD and AMI?

In regards to wether or not we had the problem using Ice-2.1.0 I can't say. Even using Ice-3.0.0 we haven't had the problem on our own inhouse servers, possibly because the haven't tested with the same kind of load as we now have. So we might have had the problem before and we might not.

mvh

Nis

benoit · February 2006

Nis Baggesen wrote:

We will start patching and recompiling and report back. I can't say when that it'll be as it will take some time to recompile, and it is an intermitten error in the first place. But we will be back as soon as we feel able so say something useful. I'm also informed that we have had some issues compiling natively on our 64 bit Xeons on our US servers, so we may need to do some ironing out.

Let us know if you need any help with these compilation issues!

Regarding the nested twoway calls, what will that imply exactly? The Facade receives twoway calls and makes twoway calls through to the actual servers. However the calls from the Facade to the servers are not routed. Are this the kind of calls that will be impossible.

Or do you mean that I can't call twoway back and forth over the routed connection?

If it is the first case would it help to handle the twoways calls via AMD and AMI?

The scenario of a nested twoway call through Glacier2 is the following:

- Client calls with a twoway request on a facade object.
- As a result of this request, the facade calls back the client with a twoway request.
- As a result of this request, the client calls again on the router with a twoway request. Here, this "nested" call would hang.

In regards to wether or not we had the problem using Ice-2.1.0 I can't say. Even using Ice-3.0.0 we haven't had the problem on our own inhouse servers, possibly because the haven't tested with the same kind of load as we now have. So we might have had the problem before and we might not.

mvh

Nis

Is it the case that you can only reproduce this issue with your servers in the US? How quickly a dead connection is detected depends a lot on the network and how the connection was lost (it might be quickly detected on a local network and take a lot of time over the internet). This Glacier2 problem would occur if a client crashes and if the router doesn't detect in a timely manner that the connection is dead... With this patch, this shouldn't be an issue anymore, each client will be isolated from each other. Let us know if you need more information!

Cheers,
Benoit.

Nis Baggesen · February 2006

benoit wrote:

Let us know if you need any help with these compilation issues!

Don't worry, we will.

benoit wrote:

The scenario of a nested twoway call through Glacier2 is the following:

- Client calls with a twoway request on a facade object.
- As a result of this request, the facade calls back the client with a twoway request.
- As a result of this request, the client calls again on the router with a twoway request. Here, this "nested" call would hang.

Ok - We aren't doing that and currently have no plans to, so that shouldn't be a problem. I'm in the process of AMD/AMI chaining our facade twoway calls, but just wanted to know if it was absolutely necessary for it to work with this patch. Doesn't seem so, so I'll proceed with that at a leasuirely pace.

benoit wrote:

Is it the case that you can only reproduce this issue with your servers in the US?

Correct. As I said it may have to do with the difference in load, but it could also very well be because of the difference between local network and internet connections. However since all the servers are still on the same internal netowrk, I just didn't think of it. The internal network may be configured slightly differently than our local one.

benoit wrote:

How quickly a dead connection is detected depends a lot on the network and how the connection was lost (it might be quickly detected on a local network and take a lot of time over the internet). This Glacier2 problem would occur if a client crashes and if the router doesn't detect in a timely manner that the connection is dead... With this patch, this shouldn't be an issue anymore, each client will be isolated from each other. Let us know if you need more information!

I'll get back to you as soon as we either have any problems compiling or have the patched code up and running.

Nis Baggesen · February 2006

Btw. I can also report that when the problem occurs, Glacier also becomes nonreposive to interrupts, and we have to use a kill -9 to kill the process. Still haven't tried the patch, but it is the first thing on our list tomorrow. Just wanted to mention the extra detail.

mvh

Nis

Nis Baggesen · February 2006

We've compiled and patched and so far things seem to be working a lot smoother than before. I'll report back if the problem begins manifesting again.

Archived

Communication problems between Glacier2 router & PermissionVerifier & SessionManager

Comments

Categories

Archived

Communication problems between Glacier2 router &amp; PermissionVerifier &amp; SessionManager

Comments

Categories

Communication problems between Glacier2 router & PermissionVerifier & SessionManager