ICE 3.4.0 / 3.4.1 bug (segmentation fault)

PrzemekD · August 2010

Hello Ice Team,

I have probably found a bug which exists in ICE 3.4.0 and 3.4.1 (other version not tested). Under some circumstances (not so rare!) ICE does something very wrong and the program exists with SEGFAULT under Linux system. It took me almost a week to find out that my code is not responsible for SEGFAULT, I have created simple program based on your example which does segfaults too. Here are some details:
- program is based on demo/minimal example
- program uses AMI for client and AMD for server side
- for AMD calls I used CallQueue (http://www.zeroc.com/newsletter/issue13/qt2.zip) which in my humble opinion should also suit well for
server side asynchronous calls
- client in main loop does:
+ create ice communicator
+ create Hello proxy with low timeout (200ms)
+ send few thousands of AMI sayHello calls and sleeps alternately
+ invoke
communicator->shutdown();
communicator->waitForShutdown();
communicator->destroy();

Client's main loop is repeated about 50 times. After a few iterations client does segfault on my machine. Here are some other observations:
- for greater value of timeout segfault is less probable
- for synchronous server calls (AMD) segfault is unlikely
- debugger shows that the error happens in ConnectionI.cpp line: 591 (copy(p, p + sizeof(Int), os->b.begin() + headerSize);), I checked that os->b.begin() points to NULL buffer; till today I thought that maybe some part of my code does some writes in memory where ICE objects are located, but I managed to isolate the problem

I put the code in the attachement. I assume that segfault might be caused by some race condition or some path of code which is executed under rare conditions so it might be not easy to reproduce the error immediately on your side.

Regardles from this please check whether the code it is valid and should not cause segfaults.

Best regards
Przemek

bernard · August 2010

Hi Przemek,

I built your test with Ice 3.4.1 and Visual Studio 2008 and it ran fine - no crash. I just had to replace sleep(1) by Sleep(1000) and add the CallQueue.cpp/obj to the build.

Maybe this is easier to reproduce on Linux. It would be helpful if you could review and improve your test case:

- the client (which is supposed to crash)

Do you really need to create and destroy all these communicators? Is thhis related or contributing to this crash? If not, you should create a single communicator, like most applications.

You should also remove all the cookie-related code, as it doesn't seem relevant.

Also, do you run this client with any configuration? Your test case didn't provide any configuration. By default, the client thread pool has just 1 thread, so there is no much concurrency.

- the server

The client is totally unaware of the implementation of the server. The client has no idea if the server is using synchronous or asynchronous dispatch, or is written in C++ or Java. This implementation may only matter in terms of timing (when empty responses are sent back to the client).

And if/when you get a crash, please capture and post the stack trace!

Best regards,
Bernard

PrzemekD · August 2010

Hi Bernard,

Sorry that I didn't provide details, here they are:

"Maybe this is easier to reproduce on Linux"

Probably you might be right, different thread timing, socket implementations etc. may be important. I didn't test it on Windows, I assumed that Ice will run in the same way on any supported OS. Also system resources like CPU usage may be important (during my tests my system was a little busy doing other things).

"Do you really need to create and destroy all these communicators? Is thhis related or contributing to this crash? If not, you should create a single communicator, like most applications"

This is not a good argument, if the bug is related to closing communicator you may not assume that Ice is allowed to crash the whole application. Also I feel that you confirmed that the code is valid and should not cause segfaults. If so let's just seek the bug because Ice reliability is under big question mark then.
Update: the code with one communicator also fails in the same way.

"You should also remove all the cookie-related code, as it doesn't seem relevant."

I'm currently testing this case, will send results later if it will help.
Update: the code without cookies also fails with the same stacktrace.

"Also, do you run this client with any configuration? Your test case didn't provide any configuration. By default, the client thread pool has just 1 thread, so there is no much concurrency."

I run client without any configuration. If I understand Ice framework well there are at least two threads involved: main thread and Ice thread from client pool. In programming it is just enough to write code with race conditions.

"The client is totally unaware of the implementation of the server. The client has no idea if the server is using synchronous or asynchronous dispatch, or is written in C++ or Java. This implementation may only matter in terms of timing (when empty responses are sent back to the client)."

I'm aware that client knows nothing about server implementation. I just don't know (I'm not sure if I would like to know) how protocol works, when and whether some acknowlegmenets are being sent or not etc. Maybe the client will fail indepedently from the server implementation, let's just skip it.
Update: client with synchronous server also fails.

Here is a stack trace caught by gdb:

(gdb) bt
#0 memmove () at ../sysdeps/i386/i686/memmove.S:68
#1 0x002d1f44 in std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<unsigned char> (
__first=0xb77e2ae8 "^\004", __last=0xb77e2aec "Tk\a\266\274\034?\220\033?\210\035?\364\217R",
__result=0xe <Address 0xe out of bounds>) at /usr/include/c++/4.4/bits/stl_algobase.h:378
#2 0x003209ea in std::__copy_move_a<false, unsigned char const*, unsigned char*> (
__first=0xb77e2ae8 "^\004", __last=0xb77e2aec "Tk\a\266\274\034?\220\033?\210\035?\364\217R",
__result=0xe <Address 0xe out of bounds>) at /usr/include/c++/4.4/bits/stl_algobase.h:397
#3 0x0031ee8c in std::__copy_move_a2<false, unsigned char const*, unsigned char*> (
__first=0xb77e2ae8 "^\004", __last=0xb77e2aec "Tk\a\266\274\034?\220\033?\210\035?\364\217R",
__result=0xe <Address 0xe out of bounds>) at /usr/include/c++/4.4/bits/stl_algobase.h:436
#4 0x0031cfda in std::copy<unsigned char const*, unsigned char*> (__first=0xb77e2ae8 "^\004",
__last=0xb77e2aec "Tk\a\266\274\034?\220\033?\210\035?\364\217R",
__result=0xe <Address 0xe out of bounds>) at /usr/include/c++/4.4/bits/stl_algobase.h:468
#5 0x00310171 in Ice::ConnectionI::sendAsyncRequest (this=0xb619b778, out=..., compress=false,
response=true) at ConnectionI.cpp:591
#6 0x002f3a0c in IceInternal::ConnectRequestHandler::flushRequests (this=0xb61109d0)
at ConnectRequestHandler.cpp:416
#7 0x002f3497 in IceInternal::ConnectRequestHandler::setConnection (this=0xb61109d0, connection=...,
compress=false) at ConnectRequestHandler.cpp:321
#8 0x00407a9f in setConnection (this=0xb6129aa0, connection=..., compress=false) at Reference.cpp:1711
#9 0x002fef76 in IceInternal::OutgoingConnectionFactory::ConnectCallback::setConnection (this=0xb6129e30,
connection=..., compress=false) at ConnectionFactory.cpp:1129
#10 0x002fd25c in IceInternal::OutgoingConnectionFactory::finishGetConnection (this=0xb61009c0,
connectors=..., ci=..., connection=..., cb=...) at ConnectionFactory.cpp:766
#11 0x002fe3ab in IceInternal::OutgoingConnectionFactory::ConnectCallback::connectionStartCompleted (
this=0xb6129e30, connection=...) at ConnectionFactory.cpp:955
#12 0x0031363a in Ice::ConnectionI::dispatch (this=0xb619b778, startCB=..., sentCBs=..., compress=0 '\000',
requestId=0, invokeNum=0, servantManager=..., adapter=..., outAsync=..., stream=...)
at ConnectionI.cpp:1443
#13 0x003134a7 in Ice::ConnectionI::message (this=0xb619b778, current=...) at ConnectionI.cpp:1428
#14 0x00446cdc in IceInternal::ThreadPool::run (this=0xb6101140, thread=...) at ThreadPool.cpp:624
---Type <return> to continue, or q <return> to quit---
#15 0x004486ef in IceInternal::ThreadPool::EventHandlerThread::run (this=0xb6101e20) at ThreadPool.cpp:1097
#16 0x005839f9 in startHook (arg=0xb6101e20) at Thread.cpp:413
#17 0x005a780e in start_thread (arg=0xb77e3b70) at pthread_create.c:300
#18 0x007c68de in clone () at ../sysdeps/unix/sysv/linux/i386/clone.S:130

I hope this should help you a lot. I have premonition that it has something common with reconnection eg. some async request objects are being stored on some list when there is no connection but they are just copies of the original request without buffers and that's the reason why it fails.

Best regards
Przemek

bernard · August 2010

Hi Przemek,

Could you post your simplified test-case?

This is not a good argument, if the bug is related to closing communicator you may not assume that Ice is allowed to crash the whole application.

It's highly desirable to create a small and simple the test-case: if there is a bug in the test-case itself, it's easier to spot in a small test-case, and if there is a bug in Ice, it's helpful to remove extra code-path unrelated to the bug in question.

Thanks,
Bernard

PrzemekD · August 2010

Hi Bernard,

In the attachement there is a simplified test-case. Sorry for late reply, it took me a lot of time to apply small changes and make a test many times.
In the attachement there is only a code for Client.cpp of "minimal" example. I mean one should take "minimal" example from current Ice distro and replace Client.cpp file.

Details of my tests (they were perfomed on virtal machine - vmware):
physical machine: Intel Core 2 duo, 2.1GHz 64bit, 4GB RAM
host OS: Windows 7 Prof.
guest machine (vmware): 32 bit, 2 core, 2GB RAM
guest OS (vmware): Ubuntu 2.6.31-20-generic #58-Ubuntu SMP
gcc: v4.4.1
Ice: 3.4.0 / 3.4.1
server/client config for Ice: none
CPU load: high or medium (doing other tasks)

Tests fails (segfalut) for timeout ~250ms when CPU load is high and for ~135ms when CPU load is medium. Please note: segfault does not always occur, sometimes program will exit without error. You have to run test a few times, at least 5 I think.

In the source you will see commented line, it may be important.

My short summary:
- error occurs only when client invokes AMI calls
- error occurs when proxy has timeout set to low value (for greater values segfault is less probable) and the proxy goes out of scope before async calls are being performed (?)
- error may exist only on Linux

How else could I help? I may provide you virtual machine settings and virtual hard disk file but it may be hard with the size of the file (a few gigabytes).

Update: in the code in commented line you will see different port number 10001 instead of 10000, just ignore it an put "10000" if you want to uncomment the line and perform your own tests ("10000" is used for other purposes on my machine so during all my tests I used "10001" and I forgot to change it in order to be compliant with "minimal" example).

Update 2: I have just run my test on real (not virtual) 64bit machine with Linux, 8 cores and 64GB of RAM, segfault also occurs in the same way. If I'll find some time I may try Windows version.

Best regards
Przemek

benoit · August 2010

Hi,

Thanks for the simplified test case. I'll look further into it.

Cheers,
Benoit.

benoit · August 2010

I was able to reproduce the problem. I believe a workaround is to disable automatic retries with --Ice.RetryIntervals=-1. I will post a patch once I have a fix. Thanks for the bug report!

Cheers,
Benoit.

Archived

ICE 3.4.0 / 3.4.1 bug (segmentation fault)

Comments

Categories