Optimizing class representations for size

aronnax · September 2008

I have been playing around with the streaming API to get a look at what sliced class representations look like over the wire (because I'm too lazy to try to compile WireShark on my Mac). I've noticed that for classes that the fully qualified class name is transmitted, as a string. Is there any way to reduce this overhead? I am thinking of a list of well known class and interface names, that each of my Ice applications have, that assigns a number (or a very short unique string) to each of the types I have created (or maybe just a few of those types.) These numbers would be transmitted over the wire instead of the full class names.

This level of optimization is probably overzealous, but I'm just curious I could implement this or if this ability already exists in Ice.

michi · September 2008

Classes support slicing: if the sender sends a derived instance, and the receiver has never heard of that type, but does understand one of its bases, the sender's derived instance is sliced down to the most-derived type that the receiver understands.

To make this possible, a type identifier of some kind must be sent over the wire for each inheritance level of the class. We use the Slice type ID for this because the type ID uniquely identifies each Slice type.

If we wanted something shorter than that, we would have to map Slice type IDs to a number such that no two Slice types end up with the same number. Moreover, sender and receiver would have to agree on the number that is assigned to each Slice type.

Because the sender may be linked with Slice-generated code that the receiver is not linked with (and vice versa), we cannot use a simple hash or some such for this. Basically, independently compiled Slice identifiers would have to produce the same number for each Slice type everywhere.

So, a simple number simply won't do the job because, as far as I can see, there is no way to universally establish agreement about what number belongs with each of the possible Slice type identifiers. The best we could do would be to compress the type ID somehow (such as encoding it in Base-64) so it would take up fewer bytes on the wire. However, that hardly is worth it: if you have an operation that passes a few parameters of class type around, the extra few bytes are completely irrelevant. (Up to request sizes of a kilobyte or so, performance is essentially the same because latency dominates and, once the request goes over a kilobyte, saving an extra ten or twenty bytes also makes no difference.)

Note that, if you send multiple instances of classes of the same type in a single request, the protocol already optimizes bandwidth. For example, you might be sending a sequence of classes. In that case, Ice sends the type ID of a class with the first instance but, for subsequence instances, it only sends an integer that identifies a previously-sent type ID.

This optimization is worthwhile because it saves bandwidth where it actually matters, namely, for requests that contain many class instances. In effect, the marshaling code already does what you suggest but, instead of relying on a global agreement for how to assign numbers to type IDs, it dynamically re-establishes that agreement for each request. (The sender assigns a number to each type and "teaches" the receiver how it has assigned the numbers.)

You can read up about the entire thing in detail in the protocol chapter of the Ice Manual.

Cheers,

Michi.

aozarov · September 2008

Could not you "remember" the ids per connection instead of per request?

marc · September 2008

This brings with it a whole host of other problems, the biggest of them being that you then have to make connections state-full, i.e., messages would then not be self-contained anymore. This represents additional challenges and increases the complexity (and decreases performance) of message routers such as Glacier2 or IceStorm.

In practice, I don't think the type-id size really matters. For small requests with single class instances, latency dominates, not message size. For large requests, where throughput matters, such as operations that transfer sequences of class instances, the message format is already very compact because the type-ids are only sent once.

(As an example for how problematic state-full connections can be, from the not-so-distant past of middleware technology, CORBA character code set negotiation was per connection, and as a result, it never worked properly with the CORBA event or notification services.)

aronnax · September 2008

I realize that thinking of the class name overhead is probably nit-picking. Perhaps my concern would make more sense given some context: I am experimenting with using Ice to provide a language-independent and network capable interface for controlling an autonomous underwater vehicle, and using IceStorm for passing messages between the hard-realtime inertial navigation system and the AI, the AI and a GUI on a computer on shore connected by a tether. Network bandwidth is potentially quite limited for us, but we can exercise extremely strict control over what classes may be passed back and forth over our (very small) network.

marc · September 2008

If your bandwidth is very limited, you might consider using compression. In particular, if you have fine-grained operations, you should consider batch oneway messages in combination with compression, since compression doesn't work well if the protocol message size is too small. Also, if you do not need class inheritance or graphs or cycles, you might consider using simple structs instead of classes.

You can find some performance data for slow connections here.

michi · September 2008

To get a handle on the effect of the type IDs, you can name your classes such that their enclosing module and the classes have single-letter names, and run tests. By "tests", I mean benchmark the performance and bandwidth consumption of your application under actual application load scenarios. Then put back the original names and repeat. Chances are that you will notice no difference whatsoever. (Note that you will of course notice a difference if you use tests deliberately designed to expose the difference. The point here is that we carefully chose the Ice encoding scheme to work such that the difference will not be noticed by applications unless they do something quite unusual.)

As Marc said, if bandwidth of your link is at a premium, you can use batching and compression to reduce the number of bytes on the wire. But, if you are that close to the limits of your data link already, I suspect it is unlikely that shorter type IDs would save the day.

Remember, per request, each unique type ID is sent over the wire only once; subsequent type IDs typically take up a single byte each. Unless your class instances are very small and contain only very little data, it is very unlikely that you would ever notice this overhead.

Cheers

Michi

Archived

Optimizing class representations for size

Comments

Categories