Archived

This forum has been archived. Please start a new discussion on GitHub.

IceStorm hangs

I've seen a numbe of other posts in the forum with similar issues, but I haven't found any of them with resolution so I wanted to bring up the issue I'm seeing here with the hopes that someone else may have some suggestions or we may be able to provide help for someone else in the future.

We have a IceStorm server that is handling a small to medium amount of traffic. Of the 8 or so topics in the server, I have about 15 different systems that are publishing to the various topics once a second or so. Some of the topics get published every second, others get published hap hazardly. This is just to give a scale of the volume of traffic.

All publishers are connected via oneway proxies.

Now, the subscribers will subscribe to the topics of interest. They are all subscribed as oneway batch (with flushing set to the one second default), so every second we see the batch of all of the published messages for that topic come in. This works very well.

However, somewhat randomly, the whole thing just freezes. That is, the subscribers stop receiving messages for anywhere from 15-30 seconds. Then, all at once, all of the messages come in again in a very large batch. We don't end up losing any of the messages (a timestamp is sent as part of the message, so we know if there are any that are missing). But the bulk just all comes at once.

I suspected threading issues, though I don't understand why it would happen if everything was oneway. My subscriber pool timeout is 2000 and my send timeout is 2500, so even if there was a hangup I would expect it to resolve faster than that.

I increased my subscriber pool thread to 3 (as well, I had it at 5 at one point). This morning I increased my Ice.ThreadPool.Server.Size to 3 to see if that helps.

I don't know if queueing is happening on the publisher side or subscriber side. Since we are batching, I would expect that the publisher side of IceStorm would take the messages and keep them queued internally releasing the publishers for the next round of messages. But then I don't know why it would cause the hangup, since I have multiple threads in the subscriber pool. ALL subscribers hangup at this exact same time and for the same amount of time.

I suppose that one way to check would be to change a publisher to a two way, such that if it was getting hung up it would throw and exception instead of just keeping the message in the transport buffer. I might try this today.

I turned on all IceStorm related tracing this morning, but haven't seen anything in the err logs related to the traces, so I don't know if I just haven't had an event that did a trace or not.

Any thoughts from the peanut gallery are welcome. If nothing else, I can just create more IceStorm servers for each topic - it just seems like the wrong solution to me though.

Thanks,
Caleb

Comments

  • benoit
    benoit Rennes, France
    Hi Caleb,

    In Ice 3.2.1, the subscriber pool is not used to flush the oneway batch messages. Instead, IceStorm uses a dedicated thread that calls flushBatchRequests on each subscriber connection at regular time intervals. If one connection hangs, the event delivery is delayed for all the other batch oneway subscribers. If you use oneway subscribers, you won't see these delays (if a subscriber connection blocks, the subscriber pool will use another thread to continue delivering events to other subscribers).

    You should also consider switching to IceStrom from Ice 3.3b which now relies on Ice new background IO feature. It doesn't have this limitation for batch oneway subscribers: a misbehaving batch oneway subscriber won't prevent other batch oneway subscribers from receiving events.

    Cheers,
    Benoit.
  • Thanks Benoit,

    I definitely want to switch to the Ice 3.3b, but this machine runs a number of Ice servers and doing the upgrade will require some facility downtime, so it's being pushed for for a little bit.

    As an aside, a few days ago we had our subscribers unbatched using one ways, with the same issue, which is why I switched to batched.

    If I have non batched subscribers, do the publishers immediately return when a message is sent, or do they wait until all subscribers successfully/unsuccessfully receive the message? I'm wondering if our hangups before weren't on the publisher side.
  • benoit
    benoit Rennes, France
    ctennis wrote: »
    If I have non batched subscribers, do the publishers immediately return when a message is sent, or do they wait until all subscribers successfully/unsuccessfully receive the message? I'm wondering if our hangups before weren't on the publisher side.

    If you use a twoway publisher proxy, the twoway call returns once IceStorm accepted and queued the event with all the subscribers. It does not wait for the queued events to be sent.

    It's not clear to me why you get such long hangs with oneway subscribers. If a subscriber pool thread is blocked on sending and once IceStorm.SubscriberPool.Timeout expires, the pool will use another thread to continue event delivery to other subscribers.

    Cheers,
    Benoit.
  • Thanks Benoit

    This is what I thought too. My only other guess is that the hangups are happening on the publisher side, since the server pool was previously running on a single thread. Since all publishers were oneway, they weren't seeing any problem, because eventually the queue cleared and all messages were sent.

    I added a few more threads to this pool to see if that fixes and have turned on subscriber pool tracing to see if when we do see the issue again if it gives any insight into what might be going on. I also have changed one of our publishers to use a two way proxy such it in case there is a hangup on the publishing side, it might catch a timeout and show some exception.

    Caleb
  • I wanted to reply to this as I think we fixed the issue, and it might be helpful to others in the future.

    There were a number of issues. First, my changes to the IceStorm properties never took effect like I thought. I found this out when I added tracing, and a day later, nothing had shown up in the logs. This is because you have to use IceBox.InheritProperties in order for these to propagate through (I was using IceGridGUI via a IceStorm template). Fixing this allowed my subscriber pool thread size to actually change.

    Also, the subscribers weren't Ice multithreaded either. The person who wrote that software wrote the application multi threaded, but did not understand the underlying Ice threading either. They were making calls to other Ice related processes that were long running and would sometimes "hang" while waiting for a response. While the rest of the program continued to run fine, no subscriber information from IceStorm would come in during that time.

    And of course, because of this, when my subscriber pool was originally single threaded this would block all other subscribers from getting their messages as well.

    I believe this is now all cleared up and running well.