Question regarding a performance issue

Alexandre · March 2019

Dear community, Ice devs,

First I'd like to thank you for the amazing software. I've been working with Ice for 6 months now and I'm very happy that it fits my needs so well.

I've got a performance issue that I can't seem to figure out. Both of the following attached slices are supposed to hold the same amount of data. I'm using both of them to send a 5 mbyte array of bytes in my test app.

I can't understand why the slow ice file is a LOT slower, like a 100 times slower that the first one. I would have expected it to be a bit slower... but 100x not at all.

Is anyone able to help me understand the issue: is it the client or the server that is slowing everything down? or maybe both? is it a known bug? or did I maybe do something wrong?

Is it maybe the structure encapsulation overhead that's artificially increasing the bandwidth? In this case, how can I send a huge amount of nested structures without casting all of that to a raw buffer?

Hope someone can help, it'd be a shame to pass data as raw byte arrays instead of proper structs.

Regards,
Alex

benoit · March 2019

Hi,

First thing to check is to ensure you're building your application with compiler optimization enabled .

Otherwise, in the fast case, the encoding/decoding can just use a memcpy system call to marshal/un-marshal the byte array. In the slow case, the generated code has to loop on each element and call a function to encode/decode the struct. That's the main difference between the two I can think of.

You could try to write a small test that does a similar copy for a 5MB data or byte array and see if you can observe the same raw performance difference.

Is the performance difference between the two significant when passing the data over the wire?

Cheers,
Benoit.

Alexandre · March 2019

Thank you for answering.

Right now my tests are done in localhost and I'm running in C#. I didn't check for the actual bandwidth being used, only the maximum send rate i could achieve in a while(true). I wasn't sure is the limitation comes from the bandwidth increasing with structs for some reason, or the added computation time on the client or the server due to the element iteration your mentioned.

What test do you suggest I should do to?
Is 100x slower something you'd expect in my case? (btw, it was tested in debug & release, more or less the same result).

benoit · March 2019

The size on the "wire" should be the same for the two examples. I think the difference is from the "computation" cost of marshalling the struct (many functions calls opposed to just copying a memory block).

You can try the small attached program. Just compile it from the command line with csc Client.cs after unzipping the file. It basically shows the performance difference between copying a byte array into another byte array and copying a sequence of struct (with a single byte element) into a byte array. The copying from the struct is a lot slower.

That said, computational cost of the marshalling shouldn't be your only criteria... Surely, the transfer over a real network will take time as well and the processing of the data on either the client or server side too. It's possible that the cost of the marshalling ends up being insignificant in a real world application.

Alexandre · March 2019

Hello Benoit and thank you the for the feedback. I've reviewed your code. Is this how things are done in the c# ice framework?
Any reason why you couldn't do an unsafe memcpy instead? is it because of possible structure miss alignement issue?

Of course if you loop every item, it's taking ages.... especially in your sample because you're nesting 2 function calls with structures as parameters (MarshalStruct, then SetByte).

If we consider the benchmark and that memcpy is taking 1 unit of time.

BlockCopy takes 1.2
MarshalStruct followed by SetByte takes 112 (yes 112)
MarshalStruct followed by a direct safe affectation takes 50 (that demonstrate the cost of nested calls with the above)
Direct safe affectation in the loop: source2[i].b = buf[i] takes 16 (the fastest safe method)
Direct unsafe affectation in the loop using ptrs takes 13

Is there a way to improve that or a reason why it's done the way it is done that i don't know of?
Thank you again for your help!

benoit · March 2019

Hi,

It sounds like we could improve a bit the marshalling by not calling SetByte... thanks for the detailed numbers, we'll have a closer look at improving this!

To answer your question, the Ice marshalling doesn't make any assumptions on the memory layout of generated structs which can be different depending on the type of the data members, the programming language, etc. Some language might align the data members on different boundaries, add some padding for example. So we can't just copy the array as-is. We have to make sure we encode the struct in a format that the receiver can understand, regardless of the processor executing the program and the language mapping used.

Can you tell us a bit more about the data you want to transmit over the wire? Are you using different languages for your clients and servers or are you only using C#?

Cheers,
Benoit.

Alexandre · March 2019

I'm doing my prototyping C# vs C# but the end product will be C# server, Cpp client.
Basically I'm trying to send a point cloud "chunk" that looks likes this:

class Message
{
int timestamp;
// irrelevant stuff
}

class RealtimePclMessage extends Message
{
PclPointList points;
// more irrelevant other stuff
}

struct PclPoint
{
Float3 position;
int color;
int intensity;
}

struct Float3
{
float x;
float y;
float z;
}

sequence< PclPoint > PclPointList;
// sequence< byte > PclPointList; // <-- i changed it to that to increase performance by 100... doing an unsafe copy of the byte array onto a struct array.

I'd assume that the data alignment needs to be managed properly on the sending side, to convert to ice "standards", and that optimisation could be made language specific on the receiving side as soon as it fits nicely with the ice standard.

I'd love to know more about where these bits are done in ice code, just to get some knowledge about the inner details.

Cheers,

benoit · March 2019

You can find all the details of the Ice data encoding in our manual here: https://doc.zeroc.com/ice/3.7/the-ice-protocol/data-encoding. Ice processes struct data members individually to marshal them into a byte stream. It doesn't add any padding or align the members on any boundary, this results in a compact encoding.

Languages, compilers or runtime environments potentially have different memory layout for structs and it can also depend on the compiler flags used to build the application. It's impossible to just rely on memory copy for transferring an array of structs between different platforms/languages unless you know exactly how the struct is packed on both side and you're sure they packed the same way. If they aren't, you need code to adapt between the two different memory layouts... which in turns imply that you'll need to encode/decode structs individually as well.

Also depending on the platform, numeric types might use different endianness...

If the marshalling of the PclPointList is really a performance concern, you could consider doing the encoding/decoding yourself and pass the data through Ice as a byte sequence. However, before looking into such optimization, I really recommend to benchmark the cost of the marshalling/un-marshalling with a real world application. Very often marshalling is insignificant compared to the network transfers or the processing time of the data on the client or server side.

Alexandre · March 2019

Thank you. Actually I'm already doing the byte sequence variant because of that performance issue, which is a real issue in a real world use case I mean x100 times slower means a lot when you're sending points, because there are a lot of them......

Was just hopping that I wouldn't have to trick ice to do that efficiently. Thank you anyway for your great feedback, and I'm looking forward to help you on that if you need any info that might help improving on your side as well.

Cheers,

benoit · March 2019

I recommend to also try with a C++ client for the prototyping to make sure it's not too difficult to support.

Another more efficient way to transmit the points could be to use separate sequences for each data member:

sequence<float> FloatSeq;
sequence<int> IntSeq;

struct PclPointList
{
    FloatSeq x;
    FloatSeq y;
    FloatSeq z;
    IntSeq color;
    IntSeq intensity;
}

I realize it might not be as practical on the receiver/sender side but Ice will be able to more efficiently marshal this list because it can directly perform memory copies for each sequence (assuming you're only using little-endian platforms).

Alexandre · March 2019

That is a great idea. I'll see if that can be applied to our process. Thanks.

Archived

Question regarding a performance issue

Comments

Categories