Archived

This forum has been archived. Please start a new discussion on GitHub.

Ice Registry blocking on restart

Hi,

to test the robustness of an IceGrid based application i tried to shut down and restart the IceGrid registry service several times during client requests and met following problem:
After restarting the IceGrid registry the process blocked and got following output:
icegridnode: warning: thread pool `IceGrid.Registry.Client.ThreadPool'
is running low on threads
Size=70, SizeMax=70, SizeWarn=56

As you see i already increased the number of max client threads, but it didn't affect anything.

Any ideas or is it maybe a bug?

I'm using Ice 3.0.0 and Suse Linux 8.1.

Grid configuration:

Node 1 + Registry:
IceGrid.InstanceName=DemoIceGrid
Ice.Default.Locator=DemoIceGrid/Locator:default -p 12000
IceGrid.Registry.Client.Endpoints=default -p 12000
IceGrid.Registry.Server.Endpoints=default
IceGrid.Registry.Internal.Endpoints=default
IceGrid.Registry.Admin.Endpoints=default
IceGrid.Registry.Data=db/registry
IceGrid.Registry.Client.ThreadPool.Size=70
IceGrid.Node.Name=fmqa07.vdlan......
IceGrid.Node.Endpoints=default
IceGrid.Node.Data=db/node
IceGrid.Node.CollocateRegistry=1
IceGrid.Node.Trace.Activator=1
IceGrid.Node.Trace.Patch=1

Node 2:
IceGrid.InstanceName=DemoIceGrid
Ice.Default.Locator=DemoIceGrid/Locator:tcp -h fmqa07.vdlan..... -p 12000
IceGrid.Registry.Client.Endpoints=tcp -h fmqa07.vdlan...... -p 12000
IceGrid.Node.Name=fmqa04.vdlan......
IceGrid.Node.Endpoints=default
IceGrid.Node.Data=db/node
IceGrid.Node.CollocateRegistry=0
IceGrid.Node.Trace.Activator=1
IceGrid.Node.Trace.Patch=1

Best Regards
Steffen

Comments

  • benoit
    benoit Rennes, France
    Hi Steffen,

    We're not aware of any problems with the icegridregistry. This could be a bug but I'm afraid it's difficult to say for sure without more information. If the registry hangs, could you attach to the process with the debugger and get a dump of all the threads (you should try with the default thread pool configuration, otherwise you might see lots of threads)? With gdb, you can easily get a dump of all the threads with the command "apply thread all bt". You can save the thread dump by running gdb inside a "script" session (see man script).

    Could you also perhaps provide a small test case that reproduces the problem?

    Cheers,
    Benoit.
  • Hi Benoit,
    benoit wrote:
    If the registry hangs, could you attach to the process with the debugger and get a dump of all the threads (you should try with the default thread pool configuration, otherwise you might see lots of threads)? With gdb, you can easily get a dump of all the threads with the command "apply thread all bt". You can save the thread dump by running gdb inside a "script" session (see man script).

    Could you also perhaps provide a small test case that reproduces the problem?

    Thank you for the immediate answer.

    I just reproduced the error with Ice 3.0.1 and the IceGrid "simple" demo, thread dump is attached.

    The modified descriptor of the simple application looks as follows:
    (adaptive load balancing instead of round robin.)
    <icegrid>
      <application name="Simple">
        <server-template id="SimpleServer">
          <parameter name="index"/>
          <server id="SimpleServer-${index}" exe="./server" activation="on-demand">
            <adapter name="Hello" endpoints="tcp" register-process="true" replica-group="ReplicatedHelloAdapter"/>
            <property name="Identity" value="hello"/>
          </server>
        </server-template>
    
        <replica-group id="ReplicatedHelloAdapter">
          <load-balancing type="adaptive"/>
          <object identity="hello" type="::Demo::Hello"/>
        </replica-group>
    
        <node name="host1...">
          <server-instance template="SimpleServer" index="1"/>
        </node>
        <node name="host2...">
          <server-instance template="SimpleServer" index="2"/>
        </node>
      </application>
    </icegrid>
    

    In my implementation the "simple app" client requests sayHello() within a loop instead of the prompt for a request in the original sample code.

    I noticed that this bug only occurs when there are concurrent client requests so i started several client instances.
    To reproduce the error i needed to shut down and restart the registry several times during the clients where running.

    Cheers
    Steffen
  • benoit
    benoit Rennes, France
    Hi Steffen,

    Thanks for the thread dump! It looks like a bug, we'll look into it and report back to you as soon as we have more information!

    Cheers,
    Benoit.
  • benoit
    benoit Rennes, France
    Hi Steffen,

    I've posted a patch for this problem [thread=2213]here[/thread]. I tried shutting down/restarting the registry in the scenario you described and everything ran fine. Please let me know if you encounter any other issues! And thanks again for the bug report!

    Cheers,
    Benoit.
  • Hi Benoit,
    benoit wrote:
    I've posted a patch for this problem [thread=2213]here[/thread]. I tried shutting down/restarting the registry in the scenario you described and everything ran fine. Please let me know if you encounter any other issues! And thanks again for the bug report!

    Thanks for your patch!
    I couldn't reproduce this problem any more.

    However i met another problem.
    As i restarted the icegrid registry after several retries of shutting down and restarting (the scenario described above) it threw a segmentation fault, core dump backtrace is attached.

    Cheers
    Steffen
  • benoit
    benoit Rennes, France
    Hi Steffen,

    I'm afraid I'm not able to reproduce this crash with our mainline on Fedora Core 4 using the GCC 4.0.2 compiler. I'll try with Ice 3.0.1 but I'll doubt this will make a difference as there was little changes in the code from the stack trace.

    Which compiler do you use on your Suse platform?

    Looking at the stack trace you sent, the addresses of the arguments passed to [noparse]IceGrid::ServerLoadCI::operator()[/noparse] (frame #0) are a bit suspicious. It looks like the "rhs" argument address is the address of the iterator rather than the address of the value type pointed by the iterator. If you're using an old compiler, could you try with a more recent compiler and see if you can still reproduce the issue?

    Thanks,

    Cheers,
    Benoit.
  • Hi Benoit,
    benoit wrote:
    I'm afraid I'm not able to reproduce this crash with our mainline on Fedora Core 4 using the GCC 4.0.2 compiler. I'll try with Ice 3.0.1 but I'll doubt this will make a difference as there was little changes in the code from the stack trace.

    Which compiler do you use on your Suse platform?

    I'm using gcc 3.2.
    benoit wrote:
    Looking at the stack trace you sent, the addresses of the arguments passed to [noparse]IceGrid::ServerLoadCI::operator()[/noparse] (frame #0) are a bit suspicious. It looks like the "rhs" argument address is the address of the iterator rather than the address of the value type pointed by the iterator. If you're using an old compiler, could you try with a more recent compiler and see if you can still reproduce the issue?

    Compiling with gcc 3.3.6 yielded the same result.
    Restarting the ice registry resulted in a segfault with nearly the same stack trace as in my posting above (line 44 of AdapterCache.cpp):

    Do you have any further ideas, hints?

    Thanks.

    Cheers,
    Steffen
  • benoit
    benoit Rennes, France
    You could try with GCC 4.x and you could also send me a patch for the modications of the IceGrid demo (or a tar.gz of the modified demo) and detailed instructions on how to reprocue the problem. This would make sure I'm using the same test case. I'll try with GCC 3.3 and let you know if I can reproduce the segfault.

    Thanks,

    Cheers,
    Benoit.
  • benoit wrote:
    You could try with GCC 4.x and you could also send me a patch for the modications of the IceGrid demo (or a tar.gz of the modified demo) and detailed instructions on how to reprocue the problem. This would make sure I'm using the same test case. I'll try with GCC 3.3 and let you know if I can reproduce the segfault.

    Trying gcc 4.0.2 unfortunately caused the same crash of the icegrid registry.
    Backtrace and patch of the modified simple demo - including deploy descriptor and configuration files of the nodes - are attached.

    My test case configuration looks as follows:
    - 2 nodes on 2 different hosts (one of them providing the registry)
    - load balancing type "adaptive" (see the deploy descriptor bugtest.xml included in the diff)

    1) start icegridnode on both hosts (on command line, not in daemon mode)
    2) start the client
    3) kill the registry icegridnode and restart it while the client is running

    repeat step 3 several times to reproduce the problem.

    Cheers,
    Steffen
  • benoit
    benoit Rennes, France
    Hi Steffen,

    I'm afraid I still can't reproduce the crash :( (tried with patched Ice-3.0.1 and GCC 4.0.2). If you still have the core file, could you print and post the content of the "lhs" and "rhs" argument in frame #0?

    Benoit.
  • Hi Benoit,
    benoit wrote:
    Hi Steffen,
    I'm afraid I still can't reproduce the crash :( (tried with patched Ice-3.0.1 and GCC 4.0.2). If you still have the core file, could you print and post the content of the "lhs" and "rhs" argument in frame #0?
    Benoit.

    i produced another core and attached a print of "lhs" and "rhs" including the backtrace.
    In addition if you wish i could mail you the gzipped core file.

    Please let me know if you need further information.

    Thanks

    Cheers,
    Steffen
  • benoit
    benoit Rennes, France
    Hi Steffen,

    I believe I found the problem. Could you apply the patch below and confirm that you can't reproduce the crash?

    Thanks!

    Benoit.
  • Hi Benoit,

    I applied the patch and couldn't reproduce the bug any more, the application was running without any problems.

    Seems that your patch solved the problem :)

    Thanks!

    Steffen
  • benoit
    benoit Rennes, France
    Great! Thanks a lot for reporting this problem and helping to solve it! I've posted the official patch [thread=2239]here[/thread].

    Cheers,
    Benoit.