Archived

This forum has been archived. Please start a new discussion on GitHub.

icestorm wait

I have a few components conected with icestorm. All component has ACM disabled (both server and client). Servern and client thread pools are set to 10. I'm using one way calls. Transport is tcp.

In some cases the test runs without a problem but sometimes the one way function calls does not return for about 20 seconds(!) than suddenly evrything start working. Seems like a deadlock than some timeout comes after 20 seconds. In what situation a one way call can wait for 20 seconds?
Thank you!

Comments

  • benoit
    benoit Rennes, France
    Hi,

    See this FAQ for the reasons why a oneway call can block. Are you sending many events to the IceStorm service? Do all the subscribers receive the events even after the hang?

    You might want to enable subscriber tracing by setting the IceStorm.Trace.Subscriber property to 1. This might show some problems while trying to communicate with some subscribers.

    Cheers,
    Benoit.
  • The traffic is not heavy i have around 40 messages per second and one message is only a few bytes.

    When the hang happens it seems all processes are hanged in a one_way call. It last for around 20 seconds than the calls are successfully completed. There are no lost messages. I set the IceStorm.Trace.Subscribe to 1 but I dont see any errors there. Components are running on 3 PCs with a dedicated hub, no other traffic goes thru this network.
    Thank you!
  • benoit
    benoit Rennes, France
    How many publishers and subscribers do you have? What is the thread pool configuration of your subscribers? I assume that the oneway calls which are blocking are the publisher calls to the IceStorm service, correct? Do the hangs perhaps happen when a new subscriber subscribes to the topic?

    Also, does it take time for the subscribers to dispatch the events? If the subscribers do some long processing for each event, it could be that subscribers can't keep up with receiving the updates and this eventually cause the publishers to block.

    To make sure the subscribers aren't responsible for the hangs, you could add some debugging code in the subscriber servants to ensure that the events are dispatched in a timely maner (i.e.: print a warning if it takes longer than N milli-seconds or seconds for a servant method to execute).

    You could also try to enable network and retry tracing on the IceStorm service to see if there's suspicious network activity. You can enable this tracing with the properties Ice.Trace.Network and Ice.Trace.Retry set to 1.

    Btw, could you also specify which Ice version, OS and compiler you're using?

    Cheers,
    Benoit.
  • I'm using ICE 3.1.0 on winxp pro platforms. I have five components three java (jdk1.5.0_08) two c++ (VC++.net 2003). Two of the components are publishers and subscirbers on the same time (but on different topics) All components sets the client and server thread pool size to 10. It runs on three machines, if i remove one machine and the one component it runs the problem doesn't happen. That component is a subscriber.

    >Do the hangs perhaps happen when a new subscriber subscribes to the topic?

    All the subscription is done before any traffic. Than one message is sent which starts up the test. I've debugged the subscriber servants they dont seem to do any long operations.

    Here is the log of the initializtion phase. I put some marks between <>.

    Thank you for your help!

    [ icebox-IceStorm: Network: accepting tcp connections at 129.59.129.195:10000 ]
    [ icebox-IceStorm: Network: accepting tcp connections at 127.0.0.1:10000 ]
    [ icebox-IceStorm: Network: accepting tcp connections at 129.59.129.195:4657 ]
    [ icebox-IceStorm: Network: accepting tcp connections at 127.0.0.1:4667 ]
    [ icebox-IceStorm: TopicManager: loading topic "sim" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav1.model" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav2.model" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav3.model" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav1.sensor" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav2.sensor" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav1.control" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav1.sensor2" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav2.control" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav2.sensor2" from database ]
    [ icebox-IceStorm: TopicManager: loading topic "uav3.control" from database ]
    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:10000
    remote address = 129.59.129.195:4645 ]
    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:10000
    remote address = 129.59.129.195:4652 ]
    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:10000
    remote address = 129.59.129.195:4660 ]
    [ icebox-IceStorm: Topic: Subscribe: 81:3b:81:c3:-7fcbe070:10e9fe00909:-7fff ]
    [ icebox-IceStorm: Topic: Subscribe: 81:3b:81:c3:-7fcbe070:10e9fe00909:-8000 ]
    [ icebox-IceStorm: Topic: Subscribe: 81:3b:81:c3:-7fcbe070:10e9fe00909:-7fff ]
    [ icebox-IceStorm: Topic: Subscribe: 81:3b:81:c3:4b50f131:10e9fe00a03:-8000 ]
    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:10000
    remote address = 129.59.129.147:2723 ]
    [ icebox-IceStorm: Topic: Subscribe: a590b83a-9442-4c29-8bc5-ca384339e854 ]
    [ icebox-IceStorm: Topic: Subscribe: 627acd00-25cf-47aa-abe7-2346053afddf ]
    [ icebox-IceStorm: Topic: Subscribe: 627acd00-25cf-47aa-abe7-2346053afddf ]
    [ icebox-IceStorm: Topic: Subscribe: 627acd00-25cf-47aa-abe7-2346053afddf ]
    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:10000
    remote address = 129.59.129.213:2848 ]
    [ icebox-IceStorm: Topic: Subscribe: eca12207-925a-44b9-b7d7-97b945f371c3 ]
    [ icebox-IceStorm: Topic: Subscribe: d510df2a-3718-4b54-bec4-85545c2424bc ]
    [ icebox-IceStorm: Topic: Subscribe: d510df2a-3718-4b54-bec4-85545c2424bc ]
    [ icebox-IceStorm: Topic: Subscribe: d510df2a-3718-4b54-bec4-85545c2424bc ]
    [ icebox-IceStorm: Topic: Subscribe: 53c2c901-af32-4c98-b503-ba697b20736e ]

    <ALL SUBSCRIPTION ARE DONE, NO TRAFFIC SO FAR>

    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:4657
    remote address = 129.59.129.195:4684 ]
    [ icebox-IceStorm: Network: tcp connection established
    local address = 129.59.129.195:4687
    remote address = 129.59.129.195:4673 ]
    [ icebox-IceStorm: Network: tcp connection established
    local address = 129.59.129.195:4690
    remote address = 129.59.129.195:4675 ]
    [ icebox-IceStorm: Network: tcp connection established
    local address = 129.59.129.195:4691
    remote address = 129.59.129.147:2726 ]
    [ icebox-IceStorm: Network: tcp connection established

    <20 SEC DELAY HERE SOMETIMES!!!>

    local address = 129.59.129.195:4692
    remote address = 129.59.129.213:2852 ]
    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:4657
    remote address = 129.59.129.195:4693 ]
    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:4657
    remote address = 129.59.129.147:2730 ]
    [ icebox-IceStorm: Network: accepted tcp connection
    local address = 129.59.129.195:4657
    remote address = 129.59.129.195:4696 ]
  • I turned out that the problem comes up if one specific machnine participates in the simulation. Now I can reproduce the behaviour with only two machines and two components one publisher and one subscirber.

    The subscirber did the subscibing than calls ic.waitForShutdown(). Than the publisher sends a message. Sometimes it gets there with no delay, sometimes the 20sec delay comes up. I suspect some lower level problem here. The network configuration seems to be ok.

    I used that machine with ICE before with no problem. It seems that the problem is icestorm specific. I continue investigation and let you know what truned out.
    Thanks
  • I ended up a very simple icestorm sample, both the publisher and subscirber are in java. Publisher blocks on the remote call, subscirber is in ic.waitForShutdown. After 20 sec it goes on normally. I tried both one-way, two-way calls, set the reilabilty qos map. Maybe i miss some configuration settings. Simple ICE remote calls work on the same two machines only icestorm hangs. Where did that 20 sec timeout comes? Tcp stack or some lower level things? I'm running infinite ping its always under 1 millisec even when my test is in hang state. Where should i put diagnostic code to figure this out?
    Thank you,
  • matthew
    matthew NL, Canada
    The 20 second delay is in IceStorm itself? What is it doing during the delay? If the problem is with IceStorm the easiest thing would be to attach a debugger and find out the stack trace during the hang.
  • benoit
    benoit Rennes, France
    Hi,

    You could also perhaps try with a C++ subscriber instead of a Java subscriber. If the Java subscriber produces a lot of garbage, it's possible that the Java GC locks up the JVM for an extended period of time (you could run the JVM with -verbose:gc as well to see if the GC might be responsible).

    Cheers,
    Benoit.