Ice C++ Performance

gesly · April 2007

Hi,

I was running some performance tests using a simple Ice Client and Server using dynamic invocation (no Slice generated code).

Ice Server: AMD only
The Ice Server simply calls ice_response(...) every time its invoked.

Ice Client: Testing Sync as well as Async
The Client measures the time it takes to complete 1000 iterations for different data sizes (100 bytes; 10,000 bytes; 100,000 bytes; 1,000,000 bytes).
The Client calls ice_invoke(..) and ice_invoke_async(..) by passing the start and end pointers to the data buffers.

The testing is done in loopback mode on a 2.8Ghz Pentium D proc with 2GB Ram running Windows XP.

The following are the numbers I got:

Sync Calls Time in msecs Requests per Sec
1,000 bytes 0.177162268 5644.542776
10,000 bytes 0.230689173 4334.837162
100,000 bytes 1.054430865 948.3789153
1,000,000 bytes 12.53972903 79.74653977

Async Calls
1,000 bytes 0.167158707 5982.338688
10,000 bytes 0.22936461 4359.870514
100,000 bytes 1.079821265 926.0791877
1,000,000 bytes 12.49719194 80.01797562

The number of requests per sec for larger data seems to be much better than the smaller data sizes. (roughly 44Mbits/sec for the 1000 bytes), which was kind of surprising to me.

Can someone shed any light on this? I was actually expecting the dynamic invocations to be better from a performance perspective.

Thanks,
Gesly

matthew · April 2007

The number of requests per sec for larger data seems to be much better than the smaller data sizes. (roughly 44Mbits/sec for the 1000 bytes), which was kind of surprising to me.

First you talk about requests per second, and then throughput. Which are you interested in? You will get higher throughput with larger blocks of data over a local network because the runtime spends more time writing and reading data than it does waiting for replies.

Can someone shed any light on this? I was actually expecting the dynamic invocations to be better from a performance perspective.

What do you mean by dynamic invocations? You were testing sync and async invocations.

I recommend reading my article "Optimizing Performance of File Transfers" in Issue 20 of Connections for some more discussion of this issue.

michi · April 2007

The number of requests per sec for larger data seems to be much better than the smaller data sizes. (roughly 44Mbits/sec for the 1000 bytes), which was kind of surprising to me.

What you are seeing are the effects of latency. For small requests, a lot of time is wasted because the client sends the request, and then waits for an eternity (comparatively) for the reply. For example, your 10,000 byte test achieves almost the same throughput as the 1,000 byte test. What this shows is that, if you go on the wire at all, you might as well send a decent amount of data with the request because whether you send 1kB or 10kB makes hardly any difference: up to somewhere around 5-10kB, the overall throughput is dominated by latency, not by the bandwidth of the network. (The exact trade-off point depends on your network bandwidth and CPU speed.)

Once you get to 100,000 bytes per request, what dominates overall throughput is how quickly the data can be marshaled and the network bandwidth. That is, the test is no longer limited by the call latency, but by throughput. In essence, the 100,000 byte test will run close to the network bandwidth (at least for simple data types that are cheap to marshal).

The 1,000,000 byte test has almost the same throughput as the 100,000 byte test. The reason it is a little bit slower than the 100,000 byte test is that the run time cannot start an RPC in the server until all in paramters have been unmarshaled by the server, and it cannot complete an RPC in the client until all the reply data has been unmarshaled in the client. This means that, for very large requests, the transfers get more "bursty", so you lose a little overall throughput in the 1,000,000 byte test as compared to the 100,000 byte test.

As far as sync/async and static/dynamic invocation are concerned, as your tests show, there is little difference between synchronous and AMI calls. That's because, for large requests, any differences in the implemention of the two invocation modes are swamped by the cost of transmitting the data. In other words, the test is dominated by bandwidth considerations, not invocation mode. For small requests, you also won't see any significant difference because then the test is dominated by the latency of going on the wire so, again, any differences in the implementation of the two invocation modes are swamped by the network latency.

You don't show you code, so I can't tell exactly what you have done. I suspect, however, that your async version behaves like the sync version, that is, the client does not start then next AMI invocation until the result of the previous AMI invocation has arrived. If you do this, you definitely will not see much of a difference in performance because, in effect, your code simply emulates the behavior of synchronous invocations. However, you will see a difference if you allow the client to have more than one outstanding AMI call. This can provide increased overall throughput because it keeps the TCP "pipe full", that is, the client can do I/O in one thread while another thread is using the CPU to process data. (IcePatch2 uses this technique to get high throughput for file transfers.)

As far as static versus dynamic dispatch is concerned, any performance differences will be negligible. That is because, with AMD, the application code simply does what, with static dispatch, the generated code would do. So, AMD and static dispatch substantially do the same things, it's just that the location of the code that calls the marshaling and unmarshaling functions changes.

Cheers,

Michi.

gesly · April 2007

What do you mean by dynamic invocations? You were testing sync and async invocations.

What I meant was using the dynamice ice mechanisms.

I recommend reading my article "Optimizing Performance of File Transfers" in Issue 20 of Connections for some more discussion of this issue.

Will do that

Thanks
Gesly

gesly · April 2007

Thanks Michi!

You don't show you code, so I can't tell exactly what you have done. I suspect, however, that your async version behaves like the sync version, that is, the client does not start then next AMI invocation until the result of the previous AMI invocation has arrived. If you do this, you definitely will not see much of a difference in performance because, in effect, your code simply emulates the behavior of synchronous invocations. However, you will see a difference if you allow the client to have more than one outstanding AMI call. This can provide increased overall throughput because it keeps the TCP "pipe full", that is, the client can do I/O in one thread while another thread is using the CPU to process data. (IcePatch2 uses this technique to get high throughput for file transfers.)

My code is pretty straight forward. I am making the async calls continuously in a loop n times (1000 times for the above test results). But I am using the default thread pool size of one thread. So I guess I do not have the luxury of one thread processing I/O and another processing data. (Is this assumption correct?)

As far as static versus dynamic dispatch is concerned, any performance differences will be negligible. That is because, with AMD, the application code simply does what, with static dispatch, the generated code would do. So, AMD and static dispatch substantially do the same things, it's just that the location of the code that calls the marshaling and unmarshaling functions changes.[\QUOTE]

Completely understand this. Just realized it after I made the post.

Maybe a bad idea to post in the same reply, but I had a question regarding Ice Storm. Is there a quick way by which I could run Ice Storm in-process, su ch that my code can set up a storm mechanism and send published data to it without a process hop? All the subscribers will register with my Storm setup. Also, I should be able to do the publish/subscribe mechanisms using dynamic ice right? The examples are all static, but looks like it should be similar.

Thanks
Gesly

michi · April 2007

gesly wrote: »

But I am using the default thread pool size of one thread. So I guess I do not have the luxury of one thread processing I/O and another processing data. (Is this assumption correct?)

Yes. If there is only one thread, then the thread cannot do anything else while it is waiting for data to arrive or while it is waiting for the kernel to accept data to be transmitted.

I had a question regarding Ice Storm. Is there a quick way by which I could run Ice Storm in-process, su ch that my code can set up a storm mechanism and send published data to it without a process hop? All the subscribers will register with my Storm setup.

IceStorm is an IceBox service. If you write your publisher as an IceBox service and run it in the same IceBox as IceStorm, messages published to IceStorm can be sent in-process. Collocating IceStorm with your own code without using IceBox is likely to be more complicated--you would have to implement at least part of the IceBox functionality in your application, or change the source code for IceStorm (I expect--I haven't tried this myself).

I'd be cautious about either approach though. Unless you have hard data that proves that this actually provides an overall performance advantage, it's likely to be a waste of time. If you go this way, you should first establsh that it will actually be worth it...

Also, I should be able to do the publish/subscribe mechanisms using dynamic ice right? The examples are all static, but looks like it should be similar.

Absolutely any operation invocation can be made statically or dynamically. The server is ignorant of how the client invokes an operation (statically or dynamically), and the client is ignorant of how the server dispatches it.

Cheers,

Michi.

gesly · May 2007

I understand. I am not yet sure that trying to run storm in process has advantages.

Does the Storm queue the data from publishers? I understand that in Storm 3.2 there is a publisher thread pool and a subscriber thread pool. The storm must be queuing up the data from the publishers so that the subscriber pool can act on them. Is there something like a separate queue for each topic or do a thread in the subscriber pool figure out for which topic? How does the Storm do this mapping from published data to who are the recipients? This mechanism and its relation to the thread pools is not clear to me.

Thanks
Gesly

matthew · May 2007

Does the Storm queue the data from publishers? I understand that in Storm 3.2 there is a publisher thread pool and a subscriber thread pool.

There is no publisher thread pool. When publishing a message to IceStorm whatever concurrency model that has been configured for the IceStorm service is used to process the incoming event (thread pool, or thread per connection).

The storm must be queuing up the data from the publishers so that the subscriber pool can act on them. Is there something like a separate queue for each topic or do a thread in the subscriber pool figure out for which topic? How does the Storm do this mapping from published data to who are the recipients? This mechanism and its relation to the thread pools is not clear to me.

When an event is published a single reference counted event is created and then pushed onto a per-subscriber queue. Threads from the subscriber pool then process the per-subscriber queue as fast as possible.

gesly · May 2007

matthew wrote: »

There is no publisher thread pool. When publishing a message to IceStorm whatever concurrency model that has been configured for the IceStorm service is used to process the incoming event (thread pool, or thread per connection).

When an event is published a single reference counted event is created and then pushed onto a per-subscriber queue. Threads from the subscriber pool then process the per-subscriber queue as fast as possible.

Ok, so there is a per subscriber queue. Thats means Storm will maintain a single queue for a subscribing client that is subscribed for 2 topics, and this allows Storm to do the optimization of batching together events (for different topics) to the client. Is this correct?

gesly · May 2007

Adding to my post, it also means that if there is a per-subscriber queue, Storm can ensure horizontal ordering (ordering for each subscriber) but not vertical ordering (all subscribers for a topic need not be dispatched data at the same time). I am not sure if there is any way of ensuring vertical ordering.

Regarding QoS, what does it mean if I used datagram and said reliability = "ordered". I am guessing that this has no meaning.

Thanks,
Gesly

matthew · May 2007

Ok, so there is a per subscriber queue. Thats means Storm will maintain a single queue for a subscribing client that is subscribed for 2 topics, and this allows Storm to do the optimization of batching together events (for different topics) to the client. Is this correct?

That is incorrect. IceStorm topics have no knowledge of one another in this respect. Each subscriber object is unique per topic & object-id.

Adding to my post, it also means that if there is a per-subscriber queue, Storm can ensure horizontal ordering (ordering for each subscriber) but not vertical ordering (all subscribers for a topic need not be dispatched data at the same time). I am not sure if there is any way of ensuring vertical ordering.

In the context of horizontal ordering (I've never heard of this term before, btw, is this in standard use in some product?) you only have ordering guarantees in the context of a single publisher, and only if that publisher is sending the messages as twoway, or in a single oneway batch. You can see my article in issue21 of Connections for more detail.

There is no way to ensure vertical ordering with IceStorm.

Regarding QoS, what does it mean if I used datagram and said reliability = "ordered". I am guessing that this has no meaning.

You'll get a BadQoS exception if you pass reliability="ordered" and a non-twoway proxy.

Archived

Ice C++ Performance

Comments

Categories