Archived

This forum has been archived. Please start a new discussion on GitHub.

Any benchmarks on using bzip2 compression?

I was wondering what people's experience has been with using the bzip2 compression scheme in Ice (or other frameworks) to reduce bandwidth requirements on slow (1200-9600 baud) networks? We have an application that ships arrays of ints and floats around the network in a binary format and were considering adding compression to the protocol. I imagine the performance will greatly depend on the data being sent, but it'd be interesting to hear of any experiences people have had with this approach.

I noticed that the Java library for Ice didn't support bzip2 compression? Is that due to not having a decent bzip2 compression library for Java or is there some other reason? (BTW, I noticed that Apache's Ant distribution includes a bzip2 input/output stream for Java. I have no idea how it performs though.)

Comments

  • I think the efficiecy is really depend on the data contents, for example, you may not expect it provide high compress for gif picture transform. So it is more reliable to do the test yourself.
    I suggest you dump your normal data to a file, then directlly call bzip2 command to compress it and check the performance.
  • Compression is CPU-intensive, so it's worth it only if the link bandwidth is fairly low. Exactly where the trade-off point is depends too much on the speed of your CPU, the link bandwidth, and the nature of the data to make any general statements. You really have to try this to find out whether it's worth it for your application. (You can use the Ice.Override.Compression property to force compression on or off, so it's easy to try this and take some measurements.)

    We'll have a look at Bzip2 for Java and see whether it is suitable.

    Cheers,

    Michi.
  • I did a little simple benchmarking with bzip2 to see how much it'd compress our data streams. It really varied a lot depending on the data set. Sometimes bzip2 compression yielded only a 10% reduction in the binary stream and other times it reduced it by 50-80%. Another variable to add to the mix is the buffer size used in the compression. Reading over the Ice documentation it looks like the chances of reducing the size of the stream would be increased by batching as well. (I imagine more practical experience is needed to see if there is a real payoff.)

    I imagine batching would increase the throughput on slow connections as well since the it appears from benchmarks like:

    Benchmark Comparison

    that communication latency doesn't change much with message size until it gets fairly large (> 1k-10k)

    Have you done benchmarks with varying amounts of batching?

    FYI:

    The bzip2 library is located in the apache-ant-1.6.2\src\main\org\apache\tools\bzip2 directory of the
    distribution downloaded from http://ant.apache.org/srcdownload.cgi

    This guy broke out the library into a separate distribution:
    http://www.kohsuke.org/bzip2/

    These guys are using a modified version of the library as well:
    http://www.jaxlib.org/docs/api/jaxlib/arc/bzip2/BZip2InputStream.html
    downloaded from:
    http://sourceforge.net/project/showfiles.php?group_id=54897

    Thanks for the responses!
  • Compression really only makes sense for slow connections. If you have a fast Ethernet, then most likely compressing and uncompressing data will take longer than sending the uncompressed message.

    Compression also doesn't help much for small messages. That's why we don't compress for messages < 100 bytes. That's an experimental value: we discovered that below about 100 bytes the compressed message size is in most cases equal to or even larger than the uncompressed message size.

    Batched messages are ideal if you have slow connections and many small oneway requests. In this case, compression of individual oneways wouldn't yield any good compression ratio, but compression of a batch message consisting of many oneways typically yields a very high compression ratio.

    To summarize, you should use compression only if:
    • You use a slow connection (modem, ISDN, DSL, etc.). Don't use compression on fast internal networks.
    • Your messages are large enough so that compression makes a difference. If you have many small oneway messages, you can use batch oneways instead.