Ice Registry blocking on restart

Steffen · March 2006

Hi,

to test the robustness of an IceGrid based application i tried to shut down and restart the IceGrid registry service several times during client requests and met following problem:
After restarting the IceGrid registry the process blocked and got following output:

icegridnode: warning: thread pool `IceGrid.Registry.Client.ThreadPool'
is running low on threads
Size=70, SizeMax=70, SizeWarn=56

As you see i already increased the number of max client threads, but it didn't affect anything.

Any ideas or is it maybe a bug?

I'm using Ice 3.0.0 and Suse Linux 8.1.

Grid configuration:

Node 1 + Registry:

IceGrid.InstanceName=DemoIceGrid
Ice.Default.Locator=DemoIceGrid/Locator:default -p 12000
IceGrid.Registry.Client.Endpoints=default -p 12000
IceGrid.Registry.Server.Endpoints=default
IceGrid.Registry.Internal.Endpoints=default
IceGrid.Registry.Admin.Endpoints=default
IceGrid.Registry.Data=db/registry
IceGrid.Registry.Client.ThreadPool.Size=70
IceGrid.Node.Name=fmqa07.vdlan......
IceGrid.Node.Endpoints=default
IceGrid.Node.Data=db/node
IceGrid.Node.CollocateRegistry=1
IceGrid.Node.Trace.Activator=1
IceGrid.Node.Trace.Patch=1

Node 2:

IceGrid.InstanceName=DemoIceGrid
Ice.Default.Locator=DemoIceGrid/Locator:tcp -h fmqa07.vdlan..... -p 12000
IceGrid.Registry.Client.Endpoints=tcp -h fmqa07.vdlan...... -p 12000
IceGrid.Node.Name=fmqa04.vdlan......
IceGrid.Node.Endpoints=default
IceGrid.Node.Data=db/node
IceGrid.Node.CollocateRegistry=0
IceGrid.Node.Trace.Activator=1
IceGrid.Node.Trace.Patch=1

Best Regards
Steffen

benoit · March 2006

Hi Steffen,

We're not aware of any problems with the icegridregistry. This could be a bug but I'm afraid it's difficult to say for sure without more information. If the registry hangs, could you attach to the process with the debugger and get a dump of all the threads (you should try with the default thread pool configuration, otherwise you might see lots of threads)? With gdb, you can easily get a dump of all the threads with the command "apply thread all bt". You can save the thread dump by running gdb inside a "script" session (see man script).

Could you also perhaps provide a small test case that reproduces the problem?

Cheers,
Benoit.

Steffen · March 2006

Hi Benoit,

benoit wrote:

If the registry hangs, could you attach to the process with the debugger and get a dump of all the threads (you should try with the default thread pool configuration, otherwise you might see lots of threads)? With gdb, you can easily get a dump of all the threads with the command "apply thread all bt". You can save the thread dump by running gdb inside a "script" session (see man script).

Could you also perhaps provide a small test case that reproduces the problem?

Thank you for the immediate answer.

I just reproduced the error with Ice 3.0.1 and the IceGrid "simple" demo, thread dump is attached.

The modified descriptor of the simple application looks as follows:
(adaptive load balancing instead of round robin.)

<icegrid>
  <application name="Simple">
    <server-template id="SimpleServer">
      <parameter name="index"/>
      <server id="SimpleServer-${index}" exe="./server" activation="on-demand">
        <adapter name="Hello" endpoints="tcp" register-process="true" replica-group="ReplicatedHelloAdapter"/>
        <property name="Identity" value="hello"/>
      </server>
    </server-template>

    <replica-group id="ReplicatedHelloAdapter">
      <load-balancing type="adaptive"/>
      <object identity="hello" type="::Demo::Hello"/>
    </replica-group>

    <node name="host1...">
      <server-instance template="SimpleServer" index="1"/>
    </node>
    <node name="host2...">
      <server-instance template="SimpleServer" index="2"/>
    </node>
  </application>
</icegrid>

In my implementation the "simple app" client requests sayHello() within a loop instead of the prompt for a request in the original sample code.

I noticed that this bug only occurs when there are concurrent client requests so i started several client instances.
To reproduce the error i needed to shut down and restart the registry several times during the clients where running.

Cheers
Steffen

benoit · March 2006

Hi Steffen,

Thanks for the thread dump! It looks like a bug, we'll look into it and report back to you as soon as we have more information!

Cheers,
Benoit.

benoit · March 2006

Hi Steffen,

I've posted a patch for this problem [thread=2213]here[/thread]. I tried shutting down/restarting the registry in the scenario you described and everything ran fine. Please let me know if you encounter any other issues! And thanks again for the bug report!

Cheers,
Benoit.

Steffen · March 2006

Hi Benoit,

benoit wrote:

I've posted a patch for this problem [thread=2213]here[/thread]. I tried shutting down/restarting the registry in the scenario you described and everything ran fine. Please let me know if you encounter any other issues! And thanks again for the bug report!

Thanks for your patch!
I couldn't reproduce this problem any more.

However i met another problem.
As i restarted the icegrid registry after several retries of shutting down and restarting (the scenario described above) it threw a segmentation fault, core dump backtrace is attached.

Cheers
Steffen

benoit · March 2006

Hi Steffen,

I'm afraid I'm not able to reproduce this crash with our mainline on Fedora Core 4 using the GCC 4.0.2 compiler. I'll try with Ice 3.0.1 but I'll doubt this will make a difference as there was little changes in the code from the stack trace.

Which compiler do you use on your Suse platform?

Looking at the stack trace you sent, the addresses of the arguments passed to [noparse]IceGrid::ServerLoadCI::operator()[/noparse] (frame #0) are a bit suspicious. It looks like the "rhs" argument address is the address of the iterator rather than the address of the value type pointed by the iterator. If you're using an old compiler, could you try with a more recent compiler and see if you can still reproduce the issue?

Thanks,

Cheers,
Benoit.

Steffen · March 2006

Hi Benoit,

benoit wrote:

I'm afraid I'm not able to reproduce this crash with our mainline on Fedora Core 4 using the GCC 4.0.2 compiler. I'll try with Ice 3.0.1 but I'll doubt this will make a difference as there was little changes in the code from the stack trace.

Which compiler do you use on your Suse platform?

I'm using gcc 3.2.

benoit wrote:

Looking at the stack trace you sent, the addresses of the arguments passed to [noparse]IceGrid::ServerLoadCI::operator()[/noparse] (frame #0) are a bit suspicious. It looks like the "rhs" argument address is the address of the iterator rather than the address of the value type pointed by the iterator. If you're using an old compiler, could you try with a more recent compiler and see if you can still reproduce the issue?

Compiling with gcc 3.3.6 yielded the same result.
Restarting the ice registry resulted in a segfault with nearly the same stack trace as in my posting above (line 44 of AdapterCache.cpp):

Do you have any further ideas, hints?

Thanks.

Cheers,
Steffen

benoit · March 2006

You could try with GCC 4.x and you could also send me a patch for the modications of the IceGrid demo (or a tar.gz of the modified demo) and detailed instructions on how to reprocue the problem. This would make sure I'm using the same test case. I'll try with GCC 3.3 and let you know if I can reproduce the segfault.

Thanks,

Cheers,
Benoit.

Steffen · March 2006

benoit wrote:

You could try with GCC 4.x and you could also send me a patch for the modications of the IceGrid demo (or a tar.gz of the modified demo) and detailed instructions on how to reprocue the problem. This would make sure I'm using the same test case. I'll try with GCC 3.3 and let you know if I can reproduce the segfault.

Trying gcc 4.0.2 unfortunately caused the same crash of the icegrid registry.
Backtrace and patch of the modified simple demo - including deploy descriptor and configuration files of the nodes - are attached.

My test case configuration looks as follows:
- 2 nodes on 2 different hosts (one of them providing the registry)
- load balancing type "adaptive" (see the deploy descriptor bugtest.xml included in the diff)

1) start icegridnode on both hosts (on command line, not in daemon mode)
2) start the client
3) kill the registry icegridnode and restart it while the client is running

repeat step 3 several times to reproduce the problem.

Cheers,
Steffen

benoit · March 2006

Hi Steffen,

I'm afraid I still can't reproduce the crash

(tried with patched Ice-3.0.1 and GCC 4.0.2). If you still have the core file, could you print and post the content of the "lhs" and "rhs" argument in frame #0?

Benoit.

Steffen · March 2006

Hi Benoit,

benoit wrote:

Hi Steffen,
I'm afraid I still can't reproduce the crash (tried with patched Ice-3.0.1 and GCC 4.0.2). If you still have the core file, could you print and post the content of the "lhs" and "rhs" argument in frame #0?
Benoit.

i produced another core and attached a print of "lhs" and "rhs" including the backtrace.
In addition if you wish i could mail you the gzipped core file.

Please let me know if you need further information.

Thanks

Cheers,
Steffen

benoit · March 2006

Hi Steffen,

I believe I found the problem. Could you apply the patch below and confirm that you can't reproduce the crash?

Thanks!

Benoit.

Steffen · March 2006

Hi Benoit,

I applied the patch and couldn't reproduce the bug any more, the application was running without any problems.

Seems that your patch solved the problem

Thanks!

Steffen

benoit · March 2006

Great! Thanks a lot for reporting this problem and helping to solve it! I've posted the official patch [thread=2239]here[/thread].

Cheers,
Benoit.

Archived

Ice Registry blocking on restart

Comments

Categories