Archived

This forum has been archived. Please start a new discussion on GitHub.

IceStorm performance

We are currently looking into the performance of the IceStorm service to see if we can use if more wisely. We plan to run some profiling today, and I may be back with some more questions when we have done that, but for now I just want to ask about some things I stumbled across when looking at the source code.

In TopicManagerI I found this:
TopicPrx
TopicManagerI::retrieve(const string& name, const Ice::Current&) const
{
    IceUtil::Mutex::Lock sync(*this);

    TopicManagerI* This = const_cast<TopicManagerI*>(this);
    This->reap();

    if(_topicIMap.find(name) != _topicIMap.end())
    {
	Ice::Identity id;
	id.name = name;
	return TopicPrx::uncheckedCast(_topicAdapter->createProxy(id));
    }

    NoSuchTopic ex;
    ex.name = name;
    throw ex;
}

The most interesting bit being the call of reap() which does the following:
void
TopicManagerI::reap()
{
    //
    // Always called with mutex locked.
    //
    // IceUtil::Mutex::Lock sync(*this);
    //
    TopicIMap::iterator i = _topicIMap.begin();
    while(i != _topicIMap.end())
    {
        if(i->second->destroyed())
        {
            if(_traceLevels->topicMgr > 0)
            {
                Ice::Trace out(_traceLevels->logger, _traceLevels->topicMgrCat);
                out << "Reaping " << i->first;
            }

            _topics.erase(i->first);

            try
            {
                Ice::Identity id;
                id.name = i->first;
                _topicAdapter->remove(id);
            }
            catch(const Ice::ObjectAdapterDeactivatedException&)
            {
                // Ignore
            }

            _topicIMap.erase(i++);
        }
        else
        {
            ++i;
        }
    }
}

It seems to me this makes the cost of a retrieve call liniar in the number of topics, all because we want to clean out destroyed topics. I was wondering why it has to be done this why - Why the removal isn't simply done when the topic is destroyed.

Likewise I found this in TopicI
//
// TODO: Optimize
//
// It's not strictly necessary to clear the error'd subscribers on
// every publish iteration (if the subscriber validates the state
// before attempting publishing the event). This means more mutex
// locks (due to the state check in the subscriber) - but with the
// advantage that publishes can occur in parallel and less
// subscriber list iterations.
//
void
IceStorm::TopicSubscribers::publish(const EventPtr& event)

And I was wondering what the priority of this optimization is.

Finally I was wondering if it would be possible to get an estimate of the cost of the various operations in the IceStorm system Something like a O(n) rough estimate of the performance in terms of the number of topics, the number of subscribers etc. I know this last bit is a bit vague, and it isn't a big priority, it would just be nice to have some idea about best practices like when to split the load over several IceStorm services, how important it is to unsubscribe and clean-up unused topics etc.

Comments

  • matthew
    matthew NL, Canada
    The reap() is there to avoid callbacks. If you want to see why this is used take a look at the article Michi wrote in the last connections newsletter. Since topics are rarely retrieved (at least compared to the publishing and subscription of events) this shouldn't cause a problem.
    And I was wondering what the priority of this optimization is.

    Not particularly high. Have you got some evidence that your application is suffering because of this?
    Finally I was wondering if it would be possible to get an estimate of the cost of the various operations in the IceStorm system Something like a O(n) rough estimate of the performance in terms of the number of topics, the number of subscribers etc. I know this last bit is a bit vague, and it isn't a big priority, it would just be nice to have some idea about best practices like when to split the load over several IceStorm services, how important it is to unsubscribe and clean-up unused topics etc.

    IceStorm is optimized to push around events. The other operations, such as subscription/unsubscription, topic retrieval and such are not very optimized, simply because we believe that this shouldn't happen very often, and therefore time spend optimizing this code is time wasted.

    I'm not sure big-O estimates are very helpful. The cost of publishing event is (as you might expect) O(n) on the number of directly subscribed subscribers on a topic. But this isn't something that really helps you.

    What I think is a more meaningful number is the max throughput of your IceStorm system. If you know the maximum throughput of event flow (for exmaple, 10,000 events per second) and you know the number of publishers, their rate, and the number of subscribers then you can calculate roughly how many your service can handle. If it cannot handle then you must split the service somehow. Then it depends why the service is overloaded.

    For example, do you have too many topics on one icestorm instance. Then split the topics across multiple servers.

    If you have too many events flowing through a single topic then you can reduce the number of events that flow through the service (batching). Or you can fan out (if you have more subscribers than publishers) through federation.
  • matthew wrote:
    The reap() is there to avoid callbacks. If you want to see why this is used take a look at the article Michi wrote in the last connections newsletter. Since topics are rarely retrieved (at least compared to the publishing and subscription of events) this shouldn't cause a problem.

    Well - we do actually retrieve topics rather often as we also use the existence of a topic as a sort of indirect registration, so we should propably change this. Though I'm still not exactly clear why it isn't simpler to clean up the topicMap when a topic is destroyed rather than wait until a create or retrieve.
    matthew wrote:
    Not particularly high. Have you got some evidence that your application is suffering because of this?
    Not yet. As I said we plan to run some profiling today, so if it turns out to be an issue I'll get back to you. I mostly asked because I found the comment in the code.
    matthew wrote:
    IceStorm is optimized to push around events. The other operations, such as subscription/unsubscription, topic retrieval and such are not very optimized, simply because we believe that this shouldn't happen very often, and therefore time spend optimizing this code is time wasted.

    I'm not sure big-O estimates are very helpful. The cost of publishing event is (as you might expect) O(n) on the number of directly subscribed subscribers on a topic. But this isn't something that really helps you.

    True - however knowing that retrieve was O(number of topics) and not O(log(number of topics)) was a bit of a surprise for instance, and I was just wondering if there where other such surprises.
    matthew wrote:
    What I think is a more meaningful number is the max throughput of your IceStorm system. If you know the maximum throughput of event flow (for exmaple, 10,000 events per second) and you know the number of publishers, their rate, and the number of subscribers then you can calculate roughly how many your service can handle. If it cannot handle then you must split the service somehow. Then it depends why the service is overloaded.

    For example, do you have too many topics on one icestorm instance. Then split the topics across multiple servers.

    If you have too many events flowing through a single topic then you can reduce the number of events that flow through the service (batching). Or you can fan out (if you have more subscribers than publishers) through federation.

    In that case the obvious question seems to be if there is a simple way measure the number of events IceStorm is handling. We can naturally do some packet tracing, but I was wondering if there where any built-in handles in IceStorm. I can naturally guestimate something, but that doesn't teach me much more than 'Now it is clogged' vs. 'Now it isn't'. Tracing can also be a bit problematic since the printing alone causes quite a bit of load when documenting 10000 events per second.

    Some further questions about how to tweak performance. What would be the difference between two IceStorm services each with X topics running on one machine, and a single IceStorm service with 2X topics (And possible twice the thread-pool). Basically - If I can't move the services to different machines do I then gain anything expect for a more finegrained diagostic regarding the performance of various topics.
  • As previously stated we wanted to run some profiling. However we have some problems tracking the actual behaviour of IceStorm, as it is simple a service running on an IceBox. So we can see where we spend time in IceBox, but not in any parts of IceStorm. We are using the GNU profiler - Do you have any experience with this or other profiling tools with respect to IceBox services.
  • benoit
    benoit Rennes, France
    Hi Nis,

    You'll find below some answers to your two previous posts.
    In that case the obvious question seems to be if there is a simple way measure the number of events IceStorm is handling. We can naturally do some packet tracing, but I was wondering if there where any built-in handles in IceStorm. I can naturally guestimate something, but that doesn't teach me much more than 'Now it is clogged' vs. 'Now it isn't'. Tracing can also be a bit problematic since the printing alone causes quite a bit of load when documenting 10000 events per second.

    IceStorm doesn't provide any statistical information about the throughput. This is perhaps something that could be added but in my experience, testing is the best way to figure out the capacity of a service. Knowning that IceStorm currently handles N events/s doesn't tell you how much it can handle :). And writing realistic performance tests to figure the capacity is often quite difficult.
    Some further questions about how to tweak performance. What would be the difference between two IceStorm services each with X topics running on one machine, and a single IceStorm service with 2X topics (And possible twice the thread-pool). Basically - If I can't move the services to different machines do I then gain anything expect for a more finegrained diagostic regarding the performance of various topics.

    As far as throughput is concerned, you shouldn't gain much by using two IceStorm services on the same machine instead of one. The only thing you'll gain is slightly faster IceStorm::TopicManager operations but as Matthew mentioned these operations aren't supposed to be performance critical unless you create a large number of topics.
    As previously stated we wanted to run some profiling. However we have some problems tracking the actual behaviour of IceStorm, as it is simple a service running on an IceBox. So we can see where we spend time in IceBox, but not in any parts of IceStorm. We are using the GNU profiler - Do you have any experience with this or other profiling tools with respect to IceBox services.

    I'm afraid we don't have any experiences with the GNU profiler. It sounds like it doesn't deal with shared libraries loaded dynamically with dlopen(). I guess you could try to link the IceBox executable with the IceStorm service and see if this helps (to link icebox with IceStorm, add "-lIceStorm" line 60 of the src/IceBox/Makefile file and make sure to build both IceBox and IceStorm first).

    Cheers,
    Benoit.
  • benoit wrote:
    IceStorm doesn't provide any statistical information about the throughput. This is perhaps something that could be added but in my experience, testing is the best way to figure out the capacity of a service. Knowning that IceStorm currently handles N events/s doesn't tell you how much it can handle :). And writing realistic performance tests to figure the capacity is often quite difficult.

    Quite true. I guess the only reason I was curious about those kind of statistics, is that it is always nice to have some may to check the guestimates you've made regarding the number of events being generated.
    benoit wrote:
    As far as throughput is concerned, you shouldn't gain much by using two IceStorm services on the same machine instead of one. The only thing you'll gain is slightly faster IceStorm::TopicManager operations but as Matthew mentioned these operations aren't supposed to be performance critical unless you create a large number of topics.

    I suppose we are generating a very large number of topics, as we generate a topic per game-object to broadcast game-object specific changes from the server to the clients. This could easily run into thousands of topics. And as mentioned we also make quite a few retrival operations. This was nice from an OO point of view, but we may have to see about remodelling that. I'm also considering taking advantage of the open source aspect of Ice, and rewriting the TopicManager slightly to improve the performance for our use cases.
    benoit wrote:
    I'm afraid we don't have any experiences with the GNU profiler. It sounds like it doesn't deal with shared libraries loaded dynamically with dlopen(). I guess you could try to link the IceBox executable with the IceStorm service and see if this helps (to link icebox with IceStorm, add "-lIceStorm" line 60 of the src/IceBox/Makefile file and make sure to build both IceBox and IceStorm first).

    Fair enough, I didn't expect you to know gprof, but figured it wouldn't hurt to ask. We'll definately try to see if changing the linking helps when we continue the profiling tomorrow.
  • matthew
    matthew NL, Canada
    Well - we do actually retrieve topics rather often as we also use the existence of a topic as a sort of indirect registration, so we should propably change this. Though I'm still not exactly clear why it isn't simpler to clean up the topicMap when a topic is destroyed rather than wait until a create or retrieve.

    Its simpler because if you clean up the topic map as soon as the topic is destroyed you have to call back on the topic manager, which means you have a circular lock which easily leads to deadlocks. By using reaping the locking only goes in one direction, which means deadlocks are not possible. Of course, its possible to write the callback in a deadlock-free manner, but its more complex and less understandable to do so.
  • matthew wrote:
    Its simpler because if you clean up the topic map as soon as the topic is destroyed you have to call back on the topic manager, which means you have a circular lock which easily leads to deadlocks. By using reaping the locking only goes in one direction, which means deadlocks are not possible. Of course, its possible to write the callback in a deadlock-free manner, but its more complex and less understandable to do so.

    Well we seem to have handled any problems we had with retrieve and create by doing lazy clean-up.

    However, with or without those changes, there seems to be a strange build-up in the IceStorm service. Over a period of 24 hours it seems to go from a nice 1% CPU load to using around 16% CPU time, with the same number of publishers and subscribers. The actual publishers and subscribers and the topics they use have changed during this period, and there may be some topics we have forgotten to clean up, and some dead subscribers that where never correctly unsubscribed, but even if there is, no events are sent to those topics, so I don't understand why they would be generating any load.
  • benoit
    benoit Rennes, France
    Hi Nis,

    What do you mean by lazy clean-up?

    Not cleaning up topics could eventually endup in increased memory usage by the IceStorm service and a lot of topics resulting in slow TopicManager operations (unless you have actually changed the reaping of the topics). I would recommend to either ensure that topics are always cleaned up or periodically check that the IceStorm topics are still being used by your application (you can retrieve all the registered topics with the TopicManager retrieveAll method). An other option would be to improve IceStorm and add some automatic reaping of unused topics (if for example the topic doesn't received any events in the last N seconds).

    Dead subscribers could also eventually result in increased memory usage and CPU usage depending on the circumstances. Dead subscrbers are automatically reaped by the IceStorm service only if IceStorm detects that the subscriber has vanished, i.e.: if the publish of an event to the subscriber resulted in an exception. This exception might not happen however if you use oneways or batch oneways to publish the events to your subscribers and if there's no communication failures when sending the events (this is the case if for example IceStorm doesn't directly connect to the subscribers but instead goes through Glacier2...).

    Again, you should ensure to cleanup subscribers (if you use Glacier2, this should be easy to do in the session destruction method). An other possibility could be to use twoway event delivery instead of oneway or batch oneway. The IceStorm service would eventually get exceptions from the subscriber if it goes through Glacier2: if the subscriber Glacier2 session is dead the IceStorm service will get an Ice::ObjectNotExistException when trying to publish an event and will reap the subscriber automatically.

    If you're using batch oneway subscribers, did you try to enable tracing with the IceStorm.Trace.Flush property to see how many subscribers the flusher thread periodically flushes?

    Could you also remind us how exactly you use the IceStorm service (if you use oneway/batch oneway/twoway subscribers, if you use federation or not, if all the subscribers are Glacier2 clients, etc)?

    Cheers,
    Benoit.
  • benoit wrote:
    Hi Nis,
    What do you mean by lazy clean-up?

    Basically the reap operations in retrieve and create have been removed. To clean-up we then do two things. First of all we look at the topic we find in retrieve/create, and if it is destroyed we clean it up, and otherwise behave as if we didn't find it. Secondly we do some periodic reaping to try to clean-up untouched topics. This significantly improved the performance of the TopicManager operations, at least during our shortterm test.
    benoit wrote:
    Not cleaning up topics could eventually endup in increased memory usage by the IceStorm service and a lot of topics resulting in slow TopicManager operations (unless you have actually changed the reaping of the topics). I would recommend to either ensure that topics are always cleaned up or periodically check that the IceStorm topics are still being used by your application (you can retrieve all the registered topics with the TopicManager retrieveAll method). An other option would be to improve IceStorm and add some automatic reaping of unused topics (if for example the topic doesn't received any events in the last N seconds).

    I'm already on the hunt for leakage of topics and subscribers, and have found some stuff.
    benoit wrote:
    Could you also remind us how exactly you use the IceStorm service (if you use oneway/batch oneway/twoway subscribers, if you use federation or not, if all the subscribers are Glacier2 clients, etc)?

    All our IceStorm communication is oneway, and everything going through Glacier2 is batched. Only communication out to the gameclients go through Glacier, communication among the various server side processes is direct.

    I think the problem might very well be lingering subscribers being kept alive because Glacier insulates the IceStorm service from communication exceptions, so I'll try to clean that up, and see if that doesn't solve the problem.