Performance

Ivan · February 2003

Hi again,

Do you have any performance benchmark for Ice?

Of course, CORBA and ICE are not the same things, but the idea and purpose I believe are the same. So, have you made any performance comparisons with different ORBs?

Thanks!
Ivan

marc · February 2003

For the simple latency tests in demo/Ice/latency, I get the following for Ice for C++:

- Pentium 4, 1.8 GHz, RH8.0, optimized: 0.15ms per twoway call.

- Pentium 4, 2.4 GHz, Windows XP, optimized: 0.12ms per twoway call.

Ice for Java has roughly twice the latency of Ice for C++.

As for comparison with CORBA ORBs, I don't think that there are any multi-purpose ORBs out there that can match this performance. Only specialized high-speed ORBs are faster, but these then usually have a simpler threading model, and of course much fewer features.

While I don't have any actual measurements for this, I believe that Ice is a lot faster than any CORBA ORB when it comes to request forwarding services, such as routers or event services. That's because Ice can forward requests as blobs, and does not have to unmarshal and remarshal Anys as in CORBA.

CatOne · March 2003

I just did a little simple testing on a relatively fast Windows XP machine -- an Athlon 2800+ with 1 GB of RAM.

I tested the ice 'latency' test (in demos\ice\latency):

100000 "pings" (round trip synchronous invocations) took 7359ms -- roughly 13600 round trip invocations per second.

I also tested TAO and another commercial ORB.

TAO did 100000 "pings" in 9830ms -- roughly 10170 round trip invocations/sec.

The commercial ORB did 100000 "pings" in 9050ms -- roughly 11050 round trip invocations/sec.

So at present ice looks to be about 20% faster than these ORBs.

marc · March 2003

Thanks for the info!

So TAO might be real-time, but doesn't seem to be real-fast

CatOne · March 2003

I think that's pretty accurate ;-)

TAO is reasonably fast as far as ORBs go, but I think people frequently confuse "real-time" with "fast." In fact real-time doesn't necessarily mean fast at all -- it's more about predictability (which when you're dealing with networking is a dodgy subject but that's another manner). I don't have a Ph.D. in real-time systems research so I'm going to stop commenting on this matter before I get killed in a discussion on it!

IMO for most systems people want a product which is fast and reliable, more than one which is predictable, so ICE has a leg up in this regard. Great work!

gthaker · March 2003

comparison of TAO 1.3.1 and Ice 1.0.1

I have been interested in performance issues for many years.

Here is a quick "look/see" at Ice's perfomance when compared to TAO. For now I consider these preliminary. (I was trying to get this done in a hurry.) However, I am reasonably confident that I have no gross error.

TAO tests use the following IDL file:

interface Account {
typedef sequence<octet> opayload;
void othruput(in opayload p);
};

I vary the size of the octet sequence from 4 to 64k bytes. Mean roundtrip latencies from clients to server are measured. (In fact I keep complete histograms, and they can be reached by following the link below and mousing over and clicking on verticle rendering of histograms.)

For Ice I used the "demo/Ice/hello" as a prototype and did the same tests in Ice with the following Slice code:

sequence<byte> seqbyte;

class Hello
{
nonmutating void sayHello();
idempotent void shutdown();
void thruput(seqbyte payload);
};

I just use "thruput" in my tests for Ice, "sayHello" is ignored.

The TAO results use the svc.conf file from performance-tests/Latency/Single_Threaded. (I have always used this in all my past TAO tests.) This svc.conf file is shown below for the record. I don't know if Ice permits similar optimizations. When Ice tests are running "top" shows 2 threads in the client and many threads in the server. (see captured output below). It is likely that some optimizations are possible with Ice. Most intriguing thing (based on reading 20 of the 758 pages of the documentaion) about Ice would have to be in the architectural issue - average performance is probably a wash. (In our DoD applications we tend to care about real-time issues, hence our interest in RT CORBA.)

I will post results of using "long" and "struct" rather than bytes later in the week.

The attached graphic (.png) file shows the curves. If you click on the complicated link below (you need to cut and paste the entire thing, be careful of line breaks etc.) you should be able to see the small subset of results fro my website that are relevant.

http://www.atl.external.lmco.com/projects/QoS/compare/cgi-bin/left2_part1.cgi?filter=smp.*(tao.*(1.2.2$|1.3.1$)|Ice)

The restricted set consists of TAO 1.2.2, TAO 1.3.1, and Ice 1.0.1. I include TAO 1.2.2 because TAO 1.3.1 results I have are a bit slower than TAO 1.2.2.

http://www.atl.external.lmco.com/projects/QoS/compare/tests/Ice/1.0.1/ice_1.0.1_misty_to_misty.html

and
http://www.atl.external.lmco.com/projects/QoS/compare/tests/orb/tao/latency/results/results_misty_to_misty/1.3.1/tao_1.3.1_misty_to_misty.html

However, the attached graphic is the mean values from these two tests overlaid.

The full website is at:

http://www.atl.external.lmco.com/projects/QoS

The "MW_Comparator" shows the entire collection of results we have (includes other ORBs, such as Mico, ORBExpress, OpenORB, JacORB, JDK builitin ORB, RMI, RMI-IIOP, some CCM and EBJ results, some SCIOP, etc. etc.)

Regards,

Gautam

# svc.conf file use in TAO tests.

# $Id: svc.conf,v 1.2 2001/08/15 19:28:42 bala Exp $
#
dynamic Advanced_Resource_Factory Service_Object * TAO_Strategies:_make_TAO_Advanced_Resource_Factory
() "-ORBresources global -ORBReactorMaskSignals 0 -ORBInputCDRAllocator null -ORBReactorType select_st
-ORBConnectionCacheLock null"
static Server_Strategy_Factory "-ORBPOALock null -ORBAllowReactivationOfSystemids 0"
static Client_Strategy_Factory "-ORBTransportMuxStrategy EXCLUSIVE -ORBProfileLock null -ORBClientConn
ectionHandler RW"

Output of "top" when Ice tests are running:

PID USER PRI NI SIZE RSS SHARE STAT %CPU %MEM TIME COMMAND
15765 gthaker 25 0 5476 5476 4768 R 32.2 1.0 5:30 client
15767 gthaker 15 0 5476 5476 4768 S 18.5 1.0 3:25 client
15762 gthaker 15 0 5416 5416 4748 S 6.7 1.0 0:59 server
15756 gthaker 15 0 5416 5416 4748 S 6.5 1.0 0:58 server
15757 gthaker 15 0 5416 5416 4748 S 5.9 1.0 0:59 server
15764 gthaker 15 0 5416 5416 4748 S 5.9 1.0 0:59 server
15763 gthaker 15 0 5416 5416 4748 S 5.5 1.0 0:59 server
15759 gthaker 15 0 5416 5416 4748 S 5.3 1.0 1:00 server
15760 gthaker 15 0 5416 5416 4748 S 5.3 1.0 0:59 server
15761 gthaker 15 0 5416 5416 4748 S 4.9 1.0 0:59 server
15755 gthaker 15 0 5416 5416 4748 S 4.1 1.0 0:57 server
15758 gthaker 15 0 5416 5416 4748 S 4.1 1.0 0:57 server

michi · March 2003

Re: comparison of TAO 1.3.1 and Ice 1.0.1

Originally posted by gthaker
I have been interested in performance issues for many years.

Here is a quick "look/see" at Ice's perfomance when compared to TAO. For now I consider these preliminary. (I was trying to get this done in a hurry.) However, I am reasonably confident that I have no gross error.

Hi Gautam,

thanks for making this effort! It will be interesting to see more detailed results (and it's nice to have them produced by someone other than ourselves, so we can legitimately claim that we didn't massage the results -- not that we ever would, of course

)

BTW -- I'd like to point out that we have done essentially no performance tuning for Ice so far, so there is at least some potential for speeding things up a bit more. However, to be honest, things are so simple already and the architecture is so clean that I don't expect spectacular improvements. (For spectacular improvements, we'd have to have a pretty bad architecture to start with to get the improvements from; but, of course, the architecture is excellent already

)

Cheers,

Michi.

marc · March 2003

Thanks a lot for the performance tests. A few thoughts and comments:

I think it is important to use equivalent concurrency models in the comparison. Apparently you used a single-threaded version of TAO and compared it against a multi-threaded version of Ice.

Single-threaded middleware, everything else being equal, is always faster than multi-threaded middleware when it comes to non-concurrent performance tests. That's because you don't have thread context switches, and you can also avoid mutex locks.

Even for multi-threaded concurrency models, there are differences. For example, with the Ice design, you can have nested method calls, because it uses a receiver thread for the client side. If you wouldn't use a receiver thread on the client side, no nested calls would be possible, but again the performance would be higher because there is less thread context switching.

The threads you see for Ice are as follows:

Client: The main thread, and one thread to receive responses from the server.

Server: The main thread (which is dormant after initialization, until waitForShutdown() returns), and the 10 threads from the thread pool to dispatch requests concurrently. (10 is just the default.)

Unaffected by the concurrency model is of course transfer of large amounts of data. As it looks, our code to handle large byte sequences is sub-optimal. That's probably because we use std::copy in our code, and naively assumed that it would use memcopy internally whenever possible. I guess we were wrong with this assumption, and we will modify the code to use memcopy wherever possible.

gthaker · March 2003

Hi,

Thanks for your comments and explanations. Ice is new to me, so it is always possible that I am not doing something correct. I will later send code out so it can be quickly looked over, but I started with "hello" and have kept things simple. Also, I agree about need to compare similar concurrency models. Not sure if Ice can be configured at run time like TAO can.

Since number of test combinations are very large I tend to test "the best that an ORB can do for simplest of test". Basically the test measures general "heaviness" of an ORB, and at times shows some divergent things like large message size costs. Also, we start with something simple like this and add all types of host and network side interference to test QoS capabilities.

I had a couple of observations. THe mapping of Slice to C++ is prob. no doubt better than CORBA's mapping. I like the fact that STL is used. Why doesn't OMG do a 2nd mapping? Because it is CORBA the original and new mapping would interoperate even.

Finally, yes, CORBA is too complex to use. Perhaps this CCM stuff, with right tools for assembly and deployment will amke things easier. Also, CORBA does have the feeling of "design by committee". There are too many standards, esp. in the area of design, analysis, UML, etc. for real-time systems. Hopefully in time things will sort out.

Gautam

marc · March 2003

I believe your modified "hello" code is correct. There is not much that can be done different for the byte sequence test.

Ice currently has only one concurrency model, the thread pool (both for the server and the client side). We might consider other, simpler concurrency models if there is demand for this.

The differences in concurrency models can be drastic. I worked in the past both on a ultra-high-speed ORB (faster than Ice), and a regular ORB (slower than Ice). The high-speed ORB used a much simpler concurrency model, and therefore came close to raw socket speed. But there is no way to achieve the same with a more elaborate concurrency model like the one in Ice.

In practice, the simple concurrency models are of limited use. They are fine for high-speed simple request-repsonse systems. But as soon as you have more complex setups, with nesting and parallel processing, they are not usable anymore.

The C++ mapping in CORBA is a sad story. I don't know about how it came into existence (at this time, I was not at the OMG), but I know that for whatever reasons (political, mostly), it was not possible to get the OMG members to start to work on a new, improved mapping.

Regarding CCM: At present, this is the realm of research projects only. AFAIK no ORB vendor offers CCM, and I believe no ORB vendor ever will. It's sad, but I believe nobody is really interested in pushing CORBA anymore, including ORB vendors, and, even more sad, including the OMG. They prefer to work on stuff like MDA ("Model Driven Architecture"), which is IMO a complete waste of time.

gthaker · March 2003

"struct" perf. measurements added

I had a longer version of this post typed up but Mozilla 1.3b crashed on me so I will try it again.

I added "struct" results to my previous measurements that were just based on "octet" results. I also show the results for same struct (see .ice file listing below) being sent around with ORBexpress and with TAO 1.2.2. Both of these last two I had, as usual, configured to provide the maximum perfomance possible. In general this means running with reduced threading. This optimization yields about a factor of 2 (or less.)

THe attached graphic show that, I believe, there are probably some low hanging fruit in way structs are shipped aorund in Ice. Ice is almost one order of magnitude slower for large message sizes. Factors of 2 are not so big a deal, a factor of 10 might worth some attention. Prob. some simple improvements might win back all of the difference.

The Ice file is:

sequence<byte> seqbyte;

struct structPayload {
int intfld; // 4 bytes
seqbyte b8; // need to be sure this is 8 bytes,
float floatfld; // 4 bytes
};

sequence<structPayload> seqstruct;

class Hello
{
nonmutating void sayHello();
idempotent void shutdown();
void thruput(seqbyte payload);
void structThruput(seqstruct payload);
};

As usual,

the following link will show the subset of data from my QoS website that was used to produce this graph.

http://www.atl.external.lmco.com/projects/QoS/compare/cgi-bin/left2_part1.cgi?filter=smp/.*(Ice|ORBexpressRT|/tao/.*(octet$|struct$))

(if you take the trouble to follwoing this link be sure it is used in its entirely and that it is not split across lines. It is also possible to use the main URL and follow down to "MW_Comparator")

http://www.atl.external.lmco.com/projects/QoS/

Regards,

Gautam

marc · March 2003

Thanks for the performance tests.

Just to make sure, you compiled Ice with optimization, right?

gthaker · March 2003

Yes, in my file Make.rules I have:

OPTIMIZE = yes

Gautam

marc · March 2003

Thanks for the info. We must definitely look into this.

However, for small messages, we cannot reproduce your results. We get lower latency in Ice compared to TAO for small messages. (See also CatOne's performance results.) Again, it is important to use the same concurrency models, otherwise latency comparisons are meaningless.

I'm also confused: Your first test report showed a similar latency for small messages for TAO and Ice, but your new test shows a much larger difference. Which test is right?

Finally, as for OrbExpress, either something is misconfigured with both Ice and TAO, or OrbExpress doesn't use TCP/IP. With the numbers from your graphics, OrbExpress would be like > 10 times faster in a latency test. This would be well beyond raw socket speed, meaning such speed is impossible with TCP/IP.

gthaker · March 2003

Marc,

First of all, I want to reiterate that all data I have is online so these graphs can be reproduced. What I mean is that one can regenerate different graphs showing different comparisons. There are so many different ways to look at the data. That said, i will try to address the points you make.

1) THe purpose of my last post is to compare "struct" marshalling cost. So I used TAO 1.2.2 results for which I have both octet and struct data. I don't have TAO 1.3.1 struct results yet. TAO 1.3.1 is bit slower than TAO 1.2.2. Thus, in prev. graph Ice 1.0.1 and TAO 1.3.1 were close the the low end. Now I show TAO 1.2.2 results and that is a bit faster. (BUt as I have said, the first factor of 1.5-2 is not always that important.)

2) I don't know how you are concluding about the graph showing ORBexpress to be > 10 faster than either TAO or Ice. The Y axis is indeed logscale but there is not an entire factor of 10 difference in the curves.

From my website I generated a new plot comparing time it takes for two processes to exchange octets of information. Fastest is shared memory (this is an SMP machine). It bypasses the network stacks and at low end it is as fast as two context swithces. Next comes TCP/IP, after that is ORBexpress, than TAO and than ICE. You can see this in the attached graphic.

I hope this is clear.

gthaker · March 2003

Since we are on this topic, let us once again return to just the octet performance. (I think Marc has prev. said that for large msg sizes Ice prob. has some optimizations that can be done; this seems esp. true when structs are involved.)

Attached graphic shows TAO and ORBexpress (1.3.1 and 2.3.5 versions, respectively), with default concurrency and with "single threaded" configuration. Here one sees that TAO 1.3.1 "default" configurations is slower than Ice 1.0.1. However, there is a cross over for larger msg sizes. You will be able to gain this back with some optimizations.

The purpose of my work is not to understand performance issues in the large and in the small. Thus, I hope the numbers speak for themselves - we have no commercial interests in any of the stuff we study.

Gautam

marc · March 2003

Originally posted by gthaker
Marc,

First of all, I want to reiterate that all data I have is online so these graphs can be reproduced. What I mean is that one can regenerate different graphs showing different comparisons. There are so many different ways to look at the data. That said, i will try to address the points you make.

1) THe purpose of my last post is to compare "struct" marshalling cost. So I used TAO 1.2.2 results for which I have both octet and struct data. I don't have TAO 1.3.1 struct results yet. TAO 1.3.1 is bit slower than TAO 1.2.2. Thus, in prev. graph Ice 1.0.1 and TAO 1.3.1 were close the the low end. Now I show TAO 1.2.2 results and that is a bit faster. (BUt as I have said, the first factor of 1.5-2 is not always that important.)

Thanks for the clarification. It is clear that there is a problem in Ice with long sequences, and we will definitely improve this.

I believe, however, that for latency tests, a factor of 1.5-2 is very significant. All I wanted to say is that I don't understand why TAO is 1.5-2 times faster in your tests, when Ice is faster in our tests. Again, I believe this is because of different concurrency models.

Originally posted by gthaker

2) I don't know how you are concluding about the graph showing ORBexpress to be > 10 faster than either TAO or Ice. The Y axis is indeed logscale but there is not an entire factor of 10 difference in the curves.

Yes, you are right, I misread the scale. I guess it's more like a factor of 5 perhaps, which is still surprising. I know that OrbExpress is probably the fastest ORB around, but I wouldn't have expected such a big difference. However, while I don't believe that Ice can match OrbExpress performance, I think it will be a lot closer if OrbExpress uses the same concurrency models. (But this is pure speculation, as I don't have access to OrbExpress.)

Originally posted by gthaker

From my website I generated a new plot comparing time it takes for two processes to exchange octets of information. Fastest is shared memory (this is an SMP machine). It bypasses the network stacks and at low end it is as fast as two context swithces. Next comes TCP/IP, after that is ORBexpress, than TAO and than ICE. You can see this in the attached graphic.

I hope this is clear.

Thanks, this is very helpful.

A great test to add for your test suite would be a nested test, which is not possible with simple concurrency models. For example, an application that has a nesting level of 5, and then measure the overall time needed for all calls. You'll find a nested demo in demo/Ice/nested.

marc · March 2003

Originally posted by gthaker
Since we are on this topic, let us once again return to just the octet performance. (I think Marc has prev. said that for large msg sizes Ice prob. has some optimizations that can be done; this seems esp. true when structs are involved.)

Attached graphic shows TAO and ORBexpress (1.3.1 and 2.3.5 versions, respectively), with default concurrency and with "single threaded" configuration. Here one sees that TAO 1.3.1 "default" configurations is slower than Ice 1.0.1. However, there is a cross over for larger msg sizes. You will be able to gain this back with some optimizations.

Thanks a lot! This is the result I would have expected. We will definitely work on improving the large-message problem.

Originally posted by gthaker
The purpose of my work is not to understand performance issues in the large and in the small. Thus, I hope the numbers speak for themselves - we have no commercial interests in any of the stuff we study.

Gautam

I wasn't suggesting anything like this. It's just that we take pride in our work here at ZeroC, so we want to make sure that if Ice is compared to some other product, that such comparison is "fair". This latest test definitely is fair, and it shows better performance for Ice for small messages, and better performance for TAO for large messages.

As for OrbExpressRT, I must admit that we can't reach their performance.

gthaker · March 2003

Actually, I made a typo. Our purpose *IS* to understand performance issues in the large and small scales. Thus I try to collect data such as shared memory, TCP/IP, various ORBs and other MW, RMI, EJB etc. And when doing a lot of testing of course configurations are important. (Hopefully we are not making any gross errors but constant cross checking helps - for example, IIOP based ORBs better always be slower than TCP, etc.)

I will try to see about the "nested" test that you mentioned.

Gautam

CatOne · March 2003

Hi Gautam,

I was thinking last night -- one of the things that Ice has which is "new and different" from other middleware is the ability to compress messages. For use cases such as yours (varying packet sizes) I'd guess this could make a significant difference, if the payload is compressible.

What are you using for your messages? Any chance you could run that same benchmark with protocol compression on and see how it performs? I'm quite curious about this -- I'd be interested to know if the bzip2 compression and trading cpu cycles for network latency has a reasonable effect.

Of course if you're sending chunked .jpeg files or something it won't help at all ;-)

marc · March 2003

Over a fast network (or even loopback), compression will rather slow down requests. Compression is intended to be used for slow connections, like modems.

Anyway, we already know what the problem with long messages is. We will fix this in the next release.

CatOne · March 2003

Marc,

That's interesting to know. The documentation doesn't mention anything about that fact (that compression is for an intended use case) -- perhaps it would be useful to note there? Do you expect there's any payload size where it would make a difference over TCP (say, 5 MB)?

With your post in the announcements forum about Mutable Realms and their "Wish" game I have a guess as to which customer requested this feature.

marc · March 2003

Originally posted by CatOne
Marc,

That's interesting to know. The documentation doesn't mention anything about that fact (that compression is for an intended use case) -- perhaps it would be useful to note there? Do you expect there's any payload size where it would make a difference over TCP (say, 5 MB)?

While I don't have any concrete measurements for this, I think for a fast network, the time to compress the data will always be longer than the actual transmission time. So compression on a fast network is not recommended.

You can make a simple test: Take a 5MB file, and compress it with bzip2 -1, and measure the time. Compare this time to the time needed to send the 5MB file over your network. I'm pretty sure the compression time will be longer in this case.

Originally posted by CatOne
With your post in the announcements forum about Mutable Realms and their "Wish" game I have a guess as to which customer requested this feature.

Your guess is correct

For Mutable Realms' game project, lowering the bandwith consumption is very important. They don't use compression within the backend, but it is used for all communications between backend and the game client. Using Glacier, this can all be done transparently.

gthaker · June 2003

Comparing Ice 1.0.1 with Ice 1.1.0

For sake of completeness I repeated my prev. roundtrip latency measurements with Ice version 1.1.0. In attached graphs I show roundtrip latencies for "octet" and "struct" messages of different sizes. (Details are in prev. postings of these thead.)

While the compiler was kept constant at GCC 3.2.1 the hardware underwent an OS change from Linux 2.4.18 to 2.4.20. So I also show on each chart corresponding TCP/IP roundtrip latencies. While Ice 1.1.0-octet measurement is slower than Ice 1.0.1 measurement the difference is essentially the same by which the kernel has slowed down. One sees similar diff for 1.1.0-struct at small message sizes, but it seems that 1.1.0 has been improved over 1.0.1 for array of structs being transported.

Note: I just realized that I think I can only attach 1 file per posting. So I will "octet" graphic here and will attach "struct" data with next post and "both" on the 3rd graphic. Sorry if I missed a way to attach multiple .png files in one posting.

As usual, all data is at:

http://www.atl.external.lmco.com/projects/QoS -> "MW_Comparator" and

Ice and TCP subset of results can be reached more easily at:

http://www.atl.external.lmco.com/projects/QoS/compare/cgi-bin/left2_part1.cgi?filter=Ice%7C%28smp.*transport%29

Gautam Thaker

gthaker · June 2003

Part II of 1.0.1 and 1.1.0 comparison

Here is the "struct" graph.

Gautam

gthaker · June 2003

Part III of Ice 1.0.1 and 1.1.0 comparison

Here is the final graph which shows all data together.

Gautam

marc · June 2003

Thanks a lot for the performance tests. This is very helpful for us, as we continue to try to further improve performance.

skolem23 · April 2005

omniORB performance

Hi,

in my current project i employed TAO, after some simple tests it turned out that TAO is by several times slower than omniORB. (i've moved to omniORB right now)

Please consider the following results provided by

http://nenya.ms.mff.cuni.cz/projects/corba/results-xampler/All

(Distributed Systems Research Group, Charles University, Prague)

Real time does not mean high performance. In my opinion it doesn't make much sense to compare TAO with ICE.

An extensive comparison between omniORB and ICE would be extremely interesting. (Since I'm with embedded stuff the memory footprints are of a major interest too)

Anyway, congratulations to the ICE developers, it seems to be great^^

marc · April 2005

We might include omniORB in future performance comparisons, but don't hold your breath. It's a lot of work to do these tests thoroughly.

As for realtime, I yet have to understand what this really means in the context of non-realtime operating systems and non-realtime transports...

Archived

Performance

Comments

Categories