Archived

This forum has been archived. Please start a new discussion on GitHub.

Performance in local operations

Hi folks,

We are thinking about using Ice as a platform for some bioinformatics applications.
The idea is to build up a kind of framework.
As we also want to ease algorithm development providing components performance
is one of our main issues.
So I did one simple test, a server reading lines of 20 characters each from a data
file and sending them to the client (both running on the same machine). To compare results I implemented the same program as simple C++ application.
The result was, in short terms that the Ice solution needed 70 times more than the "normal"
implementation.

The results in detail are (output of $>time):

Simple C++ solution
Using Ice

Reading 50 000 lines:
real 0m0.059s
real 0m3.456s
user 0m0.054s
user 0m0.500s
sys 0m0.003s
sys 0m0.623s

Reading 500 000 Lines:
real 0m0.516s
real 0m34.133s
user 0m0.508s
user 0m4.085s
sys 0m0.008s
sys 0m6.353s

Reading 5 000 000 Lines:
real 0m5.234s
real 5m51.581s
user 0m5.092s
user 0m44.777s
sys 0m0.134s
sys 1m2.860s

I also tried using the IceBox, with exactly the same results.
Ice has already been compiled with optimization.
The source code of my test programs can be found here:
http://www.bioinformatica.ufsc.br/~wolfram/IcePerformance/

I am sure that the results can be improved sending huger data packages, however
the case of many operations sending small data packages might be also important
for us.
Maybe I can improve this using another configuration?
Would be glad for some help.

Thanks a lot
Wolfram

Comments

  • Re: Performance in local operations
    Originally posted by wolfram
    So I did one simple test, a server reading lines of 20 characters each from a data
    file and sending them to the client (both running on the same machine). To compare results I implemented the same program as simple C++ application.
    The result was, in short terms that the Ice solution needed 70 times more than the "normal"
    implementation.

    The reason for this is that the invocations are made using the usual mechanism. What address are you using in your proxies? Is it the usual domain name or real IP address? If so, you could try using 127.0.0.1 instead -- it is possible that the loopback interface will be faster. (I have not tried this myself; it is also possible that you will see no difference at all.)

    If you don't see any difference, there is most likely little we can do. We cannot avoid marshaling the parameters for the calls because the client and server may be written in different languages, with obviously different data type layout. (So, there is no way to pass a C++ struct directly to a Java program without marshaling.) And using a different transport, such as shared memory, is also unlikely to yield improvements. That's because, even though the actual data transfer *may* be faster, the overhead of signaling between the two processes can kill any performance gain. (Note that on some platforms, doing TCP/IP over the loopback interface is actually faster than a shared memory transport.)

    So, what you are seeing here is largely the cost of doing IPC, that is, trapping into the kernel and context switching between processes. There is simply no way to make IPC as fast (or anywhere near as fast) as a local function call without special-purpose hardware and operating systems.

    I would try the 127.0.0.1 idea though -- it may help.

    Please let us know how you go (and what OS you are using).

    Cheers,

    Michi.
  • Using loopback

    Hello, thanks for the answer and sorry for the delay. I have been on a trip to Manaus :)

    I tried both, using the loopback and the eth0 interface, which gave exactly the same
    results.
    However I also tried what happens if I join serveral calls. In my example I did not send only
    20 characters but 2000 caracters each call. The needed time was almost falling linear.
    The remaining overhead is of about 100%.

    Reading 5 000 00 Lines with package size 2 kB (100 times larger)
    C++ solution
    using Ice
    real 0m0.516s
    real 0m1.125s
    user 0m0.508s
    user 0m0.065s
    sys 0m0.008s
    sys 0m0.064s

    Playing with numbers of transmitted characters, I noticed that the performace is increasing
    almost linear up to ~2000 chars. If I was further increasing the package sizes the results
    were getting worse. Which, what I think, should be a reason of the marsheling procedure.
    E.g. sending 6000 characters each call gave these results:

    real 0m1.415s
    user 0m0.025s
    sys 0m0.024s

    Ah, and I am using a Debian Testing Linux with Kernel 2.6.7

    Cheers + Thanks
    Wolfram
  • marc
    marc Florida
    Does your C++ implementation simply use non-distributed, direct calls? If so, I'm surprised that it is only 70x faster than using Ice RPC calls. Direct C++ calls are of course always *a lot* faster than any RPC call. The cost of a direct C++ call is basically nothing, while RPC calls have to go through the Ice core *and* the operating system, with several mutex locks, marshaling, data copies, sending on the TCP/IP stack, thread and process context switches, etc.

    There is no RPC middleware in the world that can possibly even get close to the speed of direct C++ calls.
  • Re: Using loopback
    Originally posted by wolfram
    Playing with numbers of transmitted characters, I noticed that the performace is increasing
    almost linear up to ~2000 chars. If I was further increasing the package sizes the results
    were getting worse. Which, what I think, should be a reason of the marsheling procedure.

    Hmmm... In the server, are you reading the characters that are returned to the client from a file during the operation invocation? If so, you are not only measuring the speed of Ice, but also the speed of the file system.

    The effect of getting better throughput with larger data packets is expected: up to one or two kilobytes per call, the cost of the call is dominated by latency, that is, the overhead of dispatching the call. Beyond that, the throughput is dominated by bandwidth (as the duration of each call approaches and then exceeds the call dispatch latency). But I wouldn't expect your results to get worse, that is, if you measure overall throughput in bytes per second, you should get gradually better results as you go to larger packet sizes, up a point where increasing the packet size further results in no significant performance gain (but also does not result in a performance loss).

    Cheers,

    Michi.