Archived

This forum has been archived. Please start a new discussion on GitHub.

thread pool performance (500+ clients)

Hi.
I'm developing a grid-like system. Server (coordinator) should communicate with ~1000 clients (agents). Scenario: client connects to the server, and then constantly sends data - 512 bytes sequence, 33 seq per second.

I try to create simple server and client (no any data processing on server, only form stream on client and receive on server). Results: server can dispatch only ~300 agents on 100 mbit network with 90% cpu load and only 50% of the network bandwidth use :(
Exactly same WCF-based (sic!) sample can hold 600 clients without any problem - cpu load about 20-30% while network use ~100%.
It seems, that ice can't effectively dispatch more than 200 connections (for example, 100-150 connections causes < 5% cpu load). I read, ICE uses simple select pooling.. If so, i think, that it can be couse. Maybe, i missed something, and i just should tune thread pool settings? btw, i tried TAO (without thread pool or single thread tuning) and get very close with ICE results...

OS - Windows XP/2003, Intel Core 2 duo, 100 mbit network.

Thanks!

--
Andrew Solodovnikov

Comments

  • benoit
    benoit Rennes, France
    Hi Andrew,

    We're aware of this scalability issue on Windows (note that this is a Windows only issue). As you discovered Ice uses select on Windows which doesn't scale very well for large number of connections. In Ice 3.3, Ice for C# will no longer use select anymore but instead use .NET asynchronous IO (which uses completion ports under the hood). Ice for C# scalability is therefore much improved. Making similar improvements for Ice for C++ on Windows is still on our TODO list. Most likely, this will be fixed in a post 3.3 release.

    In the meantime, your best option to better scale is to limit the number of connections handled by the Ice server thread pool. There are several ways to do this:
    • deploy multiple instances of the same server on the same machine
    • create multiple object adapters with their own thread pool in the server
    • write a connection concentrator server that forwards requests to the real server. Multiple instances of the connection concentrator can be deployed to handle many clients. There are only few connections between your server and the connection concentrator.

    In any case, keep in mind that the scalability issue of select() might not be the primary issue with a server handling 1000 clients. Surely, your server will be providing some services to these clients and the overhead caused by select() might be minor compared to the actual computation performed by the server. I would recommend to design your application so that you can deploy multiple instances of your server (on different machines) to handle the load of that many clients.

    Let us know if you need more information,

    Cheers,
    Benoit.
  • Thanks for answer.

    I surely need to handle all connections right on the server. So, as i understand, only solution is creation of the adapters pool (for example, for each 50-100 clients create new adapter with own pool - btw, is there any example?). But i'm not sure, that it helps to improve performance :(

    Thanks again!

    --
    Andrew Solodovnikov
  • Hi again!
    I just make an adapter per 64 clients - and now it works fine - 100% load of the net and 40-50% of the CPU.

    Thanks for support, ICE really makes me happy :)

    --
    Andrew Solodovnikov
  • FUI, our results:

    ICE 3.2.1:
    [HTML]
    70 140 210 280 350 420 490 560 630 700
    LAT 8030 7190 6620 5990 6640 5735 5115 4580 4160 3850
    THR 2240 4480 6720 8960 11200 13440 15680 17920 20160 22400
    CPU (ALL)0 0 0 0 0.25 0.25 1 0.25 0.25 0.25
    CPU (KRN)0 0 0 0 0.25 0.25 1 0.25 0.25 0.25
    NET 1 2 3 5 6 7 9 10 11 13
    [/HTML]

    OmniOrb 4.1.2
    [HTML]
    70 140 210 280 350 420 490 560 630 700
    LAT 11650 11470 11450 11350 11330 11160 10970 10820 10700 10600
    THR 2240 4480 6720 8960 11200 13440 15680 17920 20160 22400
    CPU (ALL)0 0.1 0.1 0.1 0 0 0 0 0 0
    CPU (KRN)0 0.1 0.1 0.1 0 0 0 0 0 0
    NET 1 2 3 5 6 7 8 10 11 12
    [/HTML]

    Upper line of the table - client count.
    LAT - latency of the empty call (calls per second)
    THR - throughput (((512 bytes * 33) per second) per client)
    NET - network load with above THR.

    configuration:

    Omin:
    server: threadPoolWatchConnection=0
    client: all by default

    Ice:
    server: Ice.ThreadPool.Server.SizeMax=20 Ice.Override.Compress=0
    client: Ice.ACM.Client=0 Ice.Override.Compress=0

    hardware:

    8 x (QuadCore 2.33, Windows 2003 x64, 4 gb ram, 1 gb network).
  • benoit
    benoit Rennes, France
    Hi Andrew,

    It's not clear to me what CPU and NET really represent. Are you running everything on the same machine or over the network? Does each client have its own process and network connection to the server?

    In any case, it looks to me that the comparison isn't really fair. OmniORB is using the thread per connection concurrency model whereas Ice is using the thread pool concurrency model. Why didn't you compare with the same concurrency model instead (either thread pool or thread per connection)?

    Cheers,
    Benoit.
  • benoit wrote: »
    Hi Andrew,

    It's not clear to me what CPU and NET really represent. Are you running everything on the same machine or over the network? Does each client have its own process and network connection to the server?

    Hi.

    NET - network load on the server side (in percents from 1 gbit).
    CPU - cpu load on the server side (user/kernel).

    Server is on the first node, other (7) nodes contain clients (same number of a clients on each node).
    In any case, it looks to me that the comparison isn't really fair. OmniORB is using the thread per connection concurrency model whereas Ice is using the thread pool concurrency model. Why didn't you compare with the same concurrency model instead (either thread pool or thread per connection)?

    Using thread pool with omniOrb causes hard cpu load (the same problem when we use ICE without separate object adapters). But we obtain same results in the latency test with omniOrb thread pool as with the thread-per connection model (as i can remember it). Difference only with the cpu load in the throughput test.