Archived

This forum has been archived. Please start a new discussion on GitHub.

One-way call sometimes takes very long to complete

We currently have a problem where a one-way proxy call sometimes takes over 189 seconds to complete.

The setup is as follows.

The client is connected via Glacier2 to a facade, which is again connected to the actual servers.

However the servers also sometimes need to push updates, so the client is also a server, and it creates a proxy to itself (clientProxy) which it registers at the servers when it connects. Some of these updates are multi-casted via IceStorm, and some just need to be given to a single client. These updates are always push as oneway calls.

But sometimes when then server wants to push an update to a single client, using the clientProxy directly the call takes 189 seconds (Always seems to be 189 second) to complete, even though it is using a oneway proxy. The call does eventually complete and no updates are lots, but the server is busy while trying to make that call, and thus builds up a queue of requests.

So much for the problem. Now a bit of theorising on my part, which may be totally wrong.

The pushed updates obviously need to be reverse routed back though Glacier2, and currently I haven't set up an explicit forwarding policy in any call context, so presumably is is forwarded using the default two-way policy. So what happens when you use a oneway proxy to call via Glacier, which then forwards it twoway ? I was wondering if I was running into the problem described in section 30.11, where you say it is possible for a oneway call to block because the transport buffer is filled faster than the server (which would be the Glacier router in this case) can empty it. However if that is the case, I don't understand why I consistently get the same wait time of 189 seconds.

The reason I haven't setup an explicit forwarding policy is that I have encountered some problems - propably because I don't understand it well enough. I'd be happy to explain in greater detail, but I wasn't sure if it belonged in this post.

Comments

  • benoit
    benoit Rennes, France
    Hi Nis,

    Whether or not the oneway call is forwarded as a twoway or oneway shouldn't matter if you use the buffered mode (forwarding as oneway or batch oneway would be more efficient however for oneway requests). In buffered mode, the router uses a request queue to forward server->client requests (and client->server requests). So when the router receives a request over a connection it adds it to the queue and releases the thread that reads requests from the connection. The request is then sent by the request queue thread.

    Does this occur consistently? Is your router under some stress when this happens?

    If the router is under some stress, it's possible that the router thread responsible for reading requests (over the connection between your server and the router) doesn't get enough the CPU and can't read the incoming requests fast enough. This would cause what you mentioned: the server would block trying to send data over the TCP/IP connection.

    It could also be that your server blocks on the connection when it tries to send the oneway request because another thread in your server already sends a large request over the connection (you can't send 2 requests at the same time over the same connection). Can you easily figure out when it blocks? If that's the case, you could try to attach with the debugger to your server and see what other threads are doing.

    Cheers,
    Benoit.
  • benoit wrote:
    Whether or not the oneway call is forwarded as a twoway or oneway shouldn't matter if you use the buffered mode (forwarding as oneway or batch oneway would be more efficient however for oneway requests).

    Glacier2 is configured to run in buffered mode so that should be ok. I do realise that forwarding one-way would be more efficient for our needs but I had some problems getting it to work. I'll detail those problems in another thread though, since they don't seem to relate to this issue.
    benoit wrote:
    Does this occur consistently? Is your router under some stress when this happens?

    If the router is under some stress, it's possible that the router thread responsible for reading requests (over the connection between your server and the router) doesn't get enough the CPU and can't read the incoming requests fast enough. This would cause what you mentioned: the server would block trying to send data over the TCP/IP connection.

    It happens consistently, in the sense that the problem will occur several times if the servers are running for hours with 30+ users. It also consistently seems to be one particular method that takes long to dispatch. However it is not every time this method is called that it takes long to dispatch.

    It also does seem to take some stress to cause the problem, as we can't really provoke it on our internal test servers. On the other hand we are only creating a CPU load of about 20%, and the network also has plenty of capacity, so the hardware doesn't seem that strained. So if stress is the problem I'm a bit at a loss as to how to improve the performance of the Glacier2 service.

    Also even with full network and protocol tracing, the problematic call does not appear in the Glacier2 traces, when I try to run it on our test server. The call still gets through to the client no problem, there just isn't a trace of it in the router. Unfortunately if we enable tracing on the live servers, too much time is spend on I/O operations to be able to judge server performance, so I can't tell you if this is also the case on the live servers.
    benoit wrote:
    It could also be that your server blocks on the connection when it tries to send the oneway request because another thread in your server already sends a large request over the connection (you can't send 2 requests at the same time over the same connection). Can you easily figure out when it blocks? If that's the case, you could try to attach with the debugger to your server and see what other threads are doing.

    Well, as I said we know which call typically blocks, but since the problem only occurs on the live servers, attaching a debugger is not that easy. But I might be able to look into it with some tracing and stuff. However most of the server activity is handled in a single threaded event queue, so there shouldn't be any other threads dispatching packages at the same time. I will look into it though.
  • benoit
    benoit Rennes, France
    Hi Nis,
    Also even with full network and protocol tracing, the problematic call does not appear in the Glacier2 traces, when I try to run it on our test server. The call still gets through to the client no problem, there just isn't a trace of it in the router.

    This is a bit suspicious -- if you enabled protocol tracing in Glacier2 you should see the call. If you don't see it, it would mean that your server is sending the call directly to the client. Did you try to stringify and print the proxy of the problematic call to see how it looks? Does your client set endpoints on its object adapter?

    Cheers,
    Benoit.
  • benoit wrote:
    This is a bit suspicious -- if you enabled protocol tracing in Glacier2 you should see the call. If you don't see it, it would mean that your server is sending the call directly to the client. Did you try to stringify and print the proxy of the problematic call to see how it looks? Does your client set endpoints on its object adapter?

    Ahh, yes. I think I might still be setting an endpoint on the client adapter. Back from when we didn't use Glacier. I'll try to remove that and see what happens on the internal test server.

    And I'll try printing out the proxy as well.
  • benoit
    benoit Rennes, France
    Ok, this could be the problem if the backend server tries to send the request to the client directly. The connection attempt to the client could fail after some time (the timeout?) because the client is behind of firewall and doesn't accept incoming connections. The Ice runtime would retry automatically using the Glacier2 router server endpoint instead and the call would succeed. You should be able to see these connection attempts and retries if you enable network and retry tracing on the server backend.

    Cheers,
    Benoit.