Archived

This forum has been archived. Please start a new discussion on GitHub.

throughput when calling concurrent calls.

We are doing some performance tests to comparing the proxy-wire_communication-service_dispatching overhead of different remoting frameworks. To do that we created a multi-threaded client that dispatch simple requests with a 1k payload and a void service that returns 1K response.
For every N threads on the client side we configure the server thread pool to have a matching number of threads.
Here are the results for Ice and plain RMI:

Ice

Server Client Throughput Latency(milliseconds)
1 1 3250/s Ave: 0.31 <1ms: 68.3% 1-2ms: 31.6%
2 2 6350/s Ave: 0.32 <1ms: 68.1% 1-2ms: 31.8%
5 5 14900/s Ave: 0.34 <1ms: 65.6% 1-2ms: 34.3%
10 10 21900/s Ave: 0.47 <1ms: 53% 1-2ms: 46.6%
20 20 22800/s Ave: 0.93 <1ms: 8.8% 1-2ms: 89.5%


Plain RMI

Server Client Throughput Latency (milliseconds)
1 1 2560/s Ave: 038 <1ms 61% 1-2ms 28%
2 2 5450/s Ave: 0.39 <1ms 60% 1-2ms 39%
5 5 12180/s Ave: 0.41 <1ms 58% 1-2ms 41%
10 10 18880/s Ave: 0.52 <1ms 48.8% 1-2ms 50%
20 20 35050/s Ave: 0.63 <1ms 41% 1-2ms 55.6%
2-3ms 2.89%

cpu utilization was not high in both cases (we were using 2 dual core machines connected by a high capacity network).
It seems that Ice has some bottleneck when exceeding 10 threads as the increased throughput between 10 threads to 20 threads is small.
Code is attached.

Comments

  • The more threads you add to the pool beyond a certain point, the worse things will get because of the cost of context switching. In particular, all the thread stacks need to be switched in and out, which causes page faults and decreases cache hits.

    That's not to say that you should have only two threads on a dual-core CPU: a few extra threads will still increase performance due to I/O interleaving. In particular, while one thread is waiting for an I/O to complete, another thread can have the CPU. But, once you have a enough threads and work load to occupy the CPU with one thread while other threads are doing I/O, adding more threads won't make things better, but will make them worse instead.

    One thing to try would be to see how things scale if you increase the number of clients but hold the number of threads constant in the server, for various thread pool sizes. In other words, have more clients than you have threads in the server pool.

    I haven't tried this, but I would expect it to scale better than having a one-to-one mapping of client- and server-side threads.

    Cheers,

    Michi.
  • matthew
    matthew NL, Canada
    Could you also let us know what operating system & JDK version you are using ?

    It is also important to realize that you are not comparing apples to apples. Sun's implementation of RMI uses a thread-per-connection model, therefore it avoids the overhead of a thread-pool. However, this comes at a cost, since threads are not cheap! A thread per connection model does not scale to large numbers of clients.

    If you want to directly compare RMI to Ice 3.2.1 you should therefore enable thread per connection (Ice.ThreadPerConnection=1). However, please note that due to scalability concerns we've removed this altogether from Ice 3.3.
  • matthew
    matthew NL, Canada
    I've been looking into this in more detail. I'm not sure what you really want to test, but assuming you want find out how your server will perform under the load of multiple concurrent clients, your test has a big flaw. Due to connection sharing the test only establishes a single connection to the server from the client. This means that everything sent is serialized on that single connection. For more information on how Ice connection management works, I recommend reading my article "Connection Management with Ice" in http://www.zeroc.com/newsletter/issue24.pdf.

    I've been running some tests of my own on a single CPU 3.2ghz RHEL4 Linux machine. The client establishes a connection per thread. Assuming the number of threads is reasonable, then I don't see any real slow-down in the overall throughput as I pile on the load.

    There is, as expected, a small but measurable performance difference when using thread per connection. However, as I previous said, thread per connection won't scale past a few hundred clients due to increased context switching between the threads. With a thread pool model to get better scalability under Linux with a large number of clients, you should also use a version of the JDK that uses epoll internally such as JDK 6.

    I've attached my test-case so you can see what I've done.
  • Answers to all replies:

    Information about the machine & OS:

    /etc/issue: Welcome to SUSE Linux Enterprise Server 10 (x86_64) - Kernel \r (\l).

    uname -a: Linux snv1graphdb002 2.6.16.21-0.8-smp #1 SMP Mon Jul 3 18:25:39 UTC 2006 x86_64 x86_64 x86_64 GNU/Linux

    dmesg is attached.

    Running with two different clients (10 threads each) had a positive effect, but not significant.

    Client 1:
    client send 185883 messages in 30.024 with ave speed of

    Client 2:
    The client send 624324 messages in 30.024 with ave speed of

    Total:
    Server Client Throughput latency
    20 10-10 26985/s 0.73 0-1 43% 1-2 45%

    Running the test code, multi-throughput.tar yield similar results:
    Threads Rep Throughput
    1 5000 3272/s
    1 10000 3367/s
    1 20000 3367/s
    2 5000 6382/s
    2 10000 6644/s
    2 20000 6653/s
    5 5000 15413/s
    5 10000 15537/s
    5 20000 15703/s
    10 5000 24295/s
    10 10000 24636/s
    10 20000 24740/s
    20 5000 22522/s
    20 10000 23500/s
    20 20000 23161/s
    As I mentioned before CPU utilization is low even when running with 20 threads (though we clearly see a significant increase in context switching - but this happens in the RMI case as well):

    When using 10 threads, 20000 rep,
    Server:
    procs
    memory
    ---swap--
    io---- -system--
    cpu
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    0 0 152 69400 141304 32284900 0 0 0 0 323 251 0 0
    100 0 0
    0 0 152 69400 141304 32284900 0 0 0 0 293 182 0 0
    100 0 0
    0 0 152 69264 141304 32284900 0 0 0 5 3731 5343 0 0 99
    0 0
    2 0 152 69264 141304 32284900 0 0 0 0 9032 13251 1 1
    98 0 0
    2 0 152 69228 141304 32284900 0 0 0 5 22581 18846 5 7
    88 0 0
    3 0 152 69224 141304 32284900 0 0 0 5 24996 21134 6 8
    86 0 0
    0 0 152 68744 141304 32284900 0 0 0 0 16735 20305 4 5
    91 0 0
    0 0 152 68492 141352 32284852 0 0 0 33 290 182 0 0
    100 0 0
    0 0 152 69364 141352 32284852 0 0 0 0 323 247 0 0
    100 0 0

    Client:
    procs
    memory
    ---swap--
    io---- -system--
    cpu
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    0 0 0 32263096 98124 327468 0 0 0 0 317 196 0 0
    100 0 0
    0 0 0 32262968 98124 327468 0 0 0 29 305 162 0 0
    100 0 0
    1 0 0 32251664 98124 328496 0 0 0 45 332 283 1 0 99
    0 0
    0 0 0 32232780 98124 328496 0 0 0 0 7292 11253 7 1
    92 0 0
    2 0 0 32179916 98168 328452 0 0 0 35 13973 26794 6 4
    90 0 0
    3 0 0 32125340 98168 328452 0 0 0 0 25258 62905 8 11
    81 0 0
    4 0 0 32113644 98184 328436 0 0 0 27 25179 63495 9 10
    81 0 0
    0 0 0 32259064 98184 328436 0 0 0 21 5954 14896 1 2
    97 0 0
    0 0 0 32259860 98184 328436 0 0 0 0 317 198 0 0
    100 0 0
    0 0 0 32262020 98184 328436 0 0 0 9 296 145 0 0
    100 0 0


    When using 20 threads, 20000 rep:
    Server:
    procs
    memory
    ---swap--
    io---- -system--
    cpu
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    0 0 152 67436 141352 32284852 0 0 29 9 1 1 0 0
    100 0 0
    0 0 152 67548 141352 32284852 0 0 0 7 1349 1759 0 0
    100 0 0
    0 0 152 67548 141352 32284852 0 0 0 0 8353 12298 1 1
    98 0 0
    2 0 152 66500 141352 32284852 0 0 0 0 17604 8376 3 5
    92 0 0
    2 0 152 66480 141352 32284852 0 0 0 0 25544 3092 6 8
    86 0 0
    2 0 152 66676 141352 32284852 0 0 0 0 25492 2857 6 9
    85 0 0
    2 0 152 67172 141352 32284852 0 0 0 0 25295 6339 6 8
    86 0 0
    2 0 152 67260 141352 32284852 0 0 0 5 23684 19282 6 9
    86 0 0
    0 0 152 67264 141352 32284852 0 0 0 0 15051 25333 3 4
    93 0 0
    0 0 152 68488 141352 32284852 0 0 0 4 2181 3036 0 0 99
    0 0
    0 0 152 69184 141352 32284852 0 0 0 0 303 201 0 0
    100 0 0
    0 0 152 69460 141352 32284852 0 0 0 4 309 227 0 0
    100 0 0


    Client:
    procs
    memory
    ---swap--
    io---- -system--
    cpu
    r b swpd free buff cache si so bi bo in cs us sy id
    wa st
    0 0 0 32262448 98184 328436 0 0 0 1 21 15 0 0
    100 0 0
    1 0 0 32237056 98184 328436 0 0 0 0 3740 6655 6 0 94
    0 0
    0 0 0 32226552 98228 329420 0 0 0 73 9022 13269 3 1
    96 0 0
    4 0 0 32130144 98228 329420 0 0 0 0 23509 48887 9 10
    81 0 0
    3 0 0 32115296 98244 329404 0 0 0 23 26182 55876 9 11
    80 0 0
    2 0 0 32112500 98244 329404 0 0 0 12 26024 55876 9 12
    79 0 0
    2 0 0 32111392 98244 329404 0 0 0 0 25396 55721 9 11
    80 0 0
    3 0 0 32110824 98244 329404 0 0 0 16 22635 57084 8 10
    82 0 0
    0 0 0 32259480 98244 329404 0 0 0 0 10422 23581 2 3
    95 0 0
    0 0 0 32259992 98244 329404 0 0 0 19 316 184 0 0
    100 0 0
    0 0 0 32261532 98244 329404 0 0 0 9 299 159 0 0
    100 0 0
    0 0 0 32261736 98244 329404 0 0 0 0 314 180 0 0
    100 0 0
    0 0 0 32261744 98244 329404 0 0 0 31 304 157 0 0
    100 0 0
    0 0 0 32261744 98244 329404 0 0 0 0 314 180 0 0
    100 0 0
    In the case of our application we expect the clients to handle more than 50 concurrent calls (span across multiple services), and for a servant (probably runs on a dedicated set of machines - replicated group) to handle about 20 concurrent calls.
  • Forgot to mention JVM info:
    java version "1.6.0_03"
    Java(TM) SE Runtime Environment (build 1.6.0_03-b05)
    Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_03-b05, mixed mode)
  • Some more information.
    I thought to reduce the number of threads on the client side via AMI but also that didn't change much:

    NON-AMI:
    Threads Throughput CPU Usage (Server, Client)
    5 14900/s 8%, 16%
    10 21900/s 14%, 28%
    20 22800/s 14%, 28%
    30 21200/s 14%, 28%
    40 21100/s 14%, 28%

    AMI:
    Threads Batch Throughput CPUT Usage (Server , Client)
    1 5 12000/s 6%, 8%
    2 5 21500/s 12%, 17%
    4 5 26000/s 14%, 26%
    6 5 25850/s 14%, 26%
    3 10 26500/s 14%, 26%
    4 10 26700/s 14%, 26%
    8 5 26000/s 14%, 26%
  • matthew
    matthew NL, Canada
    Did you change your test to establish multiple connections? If not, then all threads in the client will send to the server through a single connection, and you will not get the results you expect. You must force the client to establish a connection per thread. The simplest way to do this is to assign a unique connection id to the proxy used in each thread -- you can do this by calling ice_connectionId with a unique string (the thread-id or a UUID, for example).

    For an example of how to do this please review the test that I attached to my earlier reply.
  • Yes, we did run the test client you provided (and the numbers where not much different, as mentioned above).
    As a side note, epoll suppose to be able to handle thousands of active connections with no much sweat (see memcached as an example) therefore I am not sure why connection per thread is deprecated (especially if clients pool the open connections for re-use).
  • matthew
    matthew NL, Canada
    Thread per connection has been deprecated with Ice 3.3, not connection per thread :) Thread per connection establishes a thread for each connection on the server (note that this is the only model that RMI supports), which clearly does not scale very well since the cost of context switching becomes very high (as does the cost in memory of all the thread stacks).

    Are you saying that the RMI version does scale linearly past 10 threads? Can you also attach an RMI version of the same test?
  • Yes clear, thx. Actually to play devil advocate are you aware of this article:
    Mailinator(tm) Blog: Kill the myth please. NIO is *not* faster than IO

    Yes, in our test RMI scaled fine (code attached - and stats provided before).
    rmi.zip 17.4K
  • Just to mention that Extendable RMI (Jeri) using NIO (and does use a thread pool instead of thread per connection) didn't scale as well.
  • matthew
    matthew NL, Canada
    Most likely your observed difference between RMI and Ice is caused by the different concurrency models in use.

    RMI, as I said, uses a thread per connection concurrency model. With this type of concurrency model multiple threads will be reading concurrently from different connections.

    Ice, in contrast, by default uses a thread pool concurrency model. The implementation, at present, limits reading to a single thread at time. All remaining threads are either dispatching requests, or idle. The net effect of this is, if your application spends more time reading data from the connection than it does in dispatching and handling the request, performance will not be as good as if there were multiple readers. However, since this is the almost never the case outside of simplistic benchmarks (after all your application has to do *something* with the received data) we've never attempted to change the thread pool to permit multiple IO threads.

    I suspect, if you try the thread-per-connection concurrency model with Ice 3.2.1 you can observe a similar increase from 10-20 threads as you've observed with RMI.

    If there really is a use-case, where such an increase would be important to an application, then we could improve the thread pool concurrency model to allow configuring the number of IO threads (the number of threads that can be concurrent reads/writes). However, to this point this has never been an issue in a real-world application.
  • matthew
    matthew NL, Canada
    My colleague mentioned to me another viable technique that you might want to try -- using multiple object adapters in your server, each with its own thread pool. Using this technique you would have multiple thread-pools, each reading (and possibly writing) concurrently. In theory, this would allow better performance for the situation that you have outlined. In this way you can have your cake and eat it too :) Scalability and high-performance.