Archived

This forum has been archived. Please start a new discussion on GitHub.

src/IceUtil/CtrlCHandler.cpp broken on Solaris 10 x86_64 with latest patches

Hi,

last saturday we updated our development system with the latest solaris patches. Since then all of our ICE applications are core dumping on exit. This happens with ICE 3.1.0 and 3.2.0. Other versions we have not tested.

After some sleepless nights and testing five different compiler versions we found at least the point where the core dump is generated:

int rc = sigwait(&ctrlCLikeSignals, &signal);

in src/IceUtil/CtrlCHandler.cpp.

--
dbx session:

(dbx) run
@2 (l@2) signal SEGV (no mapping at the fault address) in _dlamd64getunwind at 0xfffffd7fff3dd521
0xfffffd7fff3dd521: _dlamd64getunwind+0x0061: movq 0x00000000000000d0(%r14),%rdi
(dbx) where
current thread: t@2
=>[1] _dlamd64getunwind(0x0, 0xffffffffffffffff, 0xfffffd7ffcffd4e0, 0xfffffd7fff3fd8c0, 0xfffffd7fff3fd720, 0x0), at 0xfffffd7fff3dd521
[2] _Unw_EhfhLookup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddcc8aa
[3] complete_context(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddccd75
[4] down_one(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddccef0
---- hidden frames, use 'where -h' to see them all ----
[8] _thrp_unwind(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddccbdc
[9] _thr_exit_common(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddc7fac
[10] _thr_exit(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddc7fee
[11] do_sigcancel(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddc14be
[12] call_user_handler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddc0aac
[13] sigacthandler(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddc0c58
---- called from signal handler with signal 36 (SIGCANCEL)
[14] ___sigtimedwait(0xfffffd7ffcffdde0, 0xfffffd7ffcffddf0, 0x0, 0xe, 0x0, 0xfffffd7ffcffdfe0, 0x4003, 0x0, 0xfffffd7ffcffde10), at 0xfffffd7ffddcd7ea
[15] __sigtimedwait(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddbfa77
[16] _sigwait(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddbfb4d
[17] __posix_sigwait(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddb7be0
[18] sigwaitThread(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffe9a540c
[19] _thr_setup(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddcb40b
[20] _lwp_start(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7ffddcb640
(dbx) threads
t@1 a l@1 ?() running in __lwp_wait()
*> t@2 a l@2 sigwaitThread() signal SIGSEGV in _dlamd64getunwind()

--
We have no idea at all, whats the problem or how to ix it.

--
On my laptop, which I updated the last time at March 30. 2007, everything runs as expected. I made a diff of the installed versions between my laptop and our development system. (Look at the end for the complete list)

I suspect this patches to cause the problem.
124923 02 < 03 RS- 57 SunOS 5.10_x86: ld.so.1 patch
125101 04 < 08 RS- 12 SunOS 5.10_x86: Kernel Update patch

--
We tried the following compilers:

Sun Studio 11 (vanilla)
Sun Studio 11 (latest patches)
Sun Studio 12 (early access 12/06)
Sun Studio 12 (early access 04/07)
Sun Studio 12 (vanilla)

Do you have any idea how to fix that?

Can we do anything to help finding the problem.

Best regards,
Markus





bash-3.00# ./pca
Using /var/tmp/patchdiag.xref from Jun/01/07
Host: amilo-1 (SunOS 5.10/Generic_125101-04/i386/i86pc)

Patch IR CR RSB Age Synopsis
-- - -- --- ---
118372 09 < 10 RS- 50 SunOS 5.10_x86: elfsign patch
118778 07 < 08 --- 39 SunOS 5.10_x86: Sun GigaSwift Ethernet 1.0 driver patch
119060 20 < 24 RS- 12 X11 6.6.2_x86: Xsun patch
119091 23 < 24 --- 32 SunOS 5.10_x86: Sun iSCSI Device Driver and Utilities
119116 25 < 27 -S- 4 Mozilla 1.7_x86 patch
119118 30 < 33 --- 35 Evolution 1.4.6_x86 patch
119279 12 < 15 --- 20 CDE 1.6_x86: dtlogin patch
119281 10 < 11 --- 60 CDE 1.6_x86: Runtime library patch for Solaris 10
119316 07 < 08 --- 11 SunOS 5.10_x86: Solaris Management Applications Patch
119471 08 < 09 --- 4 SunOS 5.10_x86: Sun Enterprise Network Array firmware and utilitie
119539 10 < 11 --- 60 GNOME 2.6.0_x86: Window Manager Patch
119541 05 < 06 --- 6 GNOME 2.6.0_x86: Gnome Dtlogin configuration Patch
119547 07 < 08 --- 12 APOC 1.2_x86: APOC Configuration Agent Patch
119549 07 < 08 -S- 61 GNOME 2.6.0_x86: Gnome Multi-protocol instant messaging client Pat
119686 10 < 11 R-- 46 SunOS 5.10_x86: svc.startd patch
119704 08 < 09 --- 61 SunOS 5.10_x86: Patch for localeadm issues
119789 07 < 08 --- 33 Synopsis: SunOS 5.10_x86: Sun Update Connection Proxy 1.0.9
119811 03 < 04 --- 35 SunOS 5.10_x86: International Components for Unicode Patch
119813 03 < 05 RS- 11 X11 6.6.2_x86: Freetype patch
119975 06 < 07 --- 41 SunOS 5.10_x86: fp plug-in for cfgadm
120033 02 < 03 --- 50 SunOS 5.10_x86: libresolv.so.2 patch
120037 11 < 19 RS- 4 SunOS 5.10_x86: libc nss ldap PAM zfs patch
120051 05 < 06 RS- 29 SunOS 5.10_x86: usermod patch
120086 01 < 02 RS- 50 SunOS 5.10_x86: patch usr/sbin/in.ftpd
120095 10 < 12 -S- 4 X11 6.6.2_x86: xscreensaver patch
120100 07 < 08 --- 12 APOC 1.2_x86: Sun Java(tm) Desktop System Configuration Shared Lib
120183 05 < 06 --- 41 SunOS 5.10_x86: Sun Fibre Channel Host Bus Adapter Library
120202 03 < 04 --- 57 X11 6.8.0_x86: Xorg client libraries patch
120223 16 < 17 R-- 5 SunOS 5.10_x86: Emulex-Sun LightPulse Fibre Channel Adapter driver
120273 07 < 13 RS- 7 SunOS 5.10_x86: SMA patch
120536 13 < 15 --- 4 SunOS 5.10_x86: Updated video drivers and fixes
120544 08 < 09 -SB 56 SunOS 5.10_x86: Apache 2 Patch
120630 04 < 05 --- 35 SunOS 5.10_x86: libpool patch
120759 11 < 13 --- 13 Sun Studio 11_x86: Sun Compiler Common patch for x86 backend
120846 04 < 05 --- 47 SunOS 5.10_x86: auditd patch
120888 06 < 07 --- 41 SunOS 5.10_x86: cdrw patch
121011 04 < 05 R-- 40 SunOS 5.10_x86: rpc.metad patch
121119 11 < 12 R-- 22 SunOS 5.10_x86: Sun Update Connection System Client 1.0.9
121212 01 < 02 -S- 50 SunOS 5.10_x86: Sun Java Web Console (Lockhart) Patch
121230 01 < 02 RS- 43 SunOS 5.10_x86: libssl patch
121287 01 < 02 --- 40 SunOS 5.10_x86: pcn driver patch
121289 02 < 04 RS- 11 SunOS 5.10_x86: inetd & svcs patch
121309 08 < 09 RS- 14 SunOS 5.10_x86: Solaris Management Console Patch
121429 03 < 04 --- 40 SunOS 5.10_x86: Live Upgrade Zones Support Patch
121431 13 < 14 --- 47 SunOS 5.8_x86 5.9_x86 5.10_x86: Live Upgrade Patch
121902 01 < 02 R-- 39 SunOS 5.10_x86: i.manifest r.manifest class action script patch
122148 -- < 01 --- 43 Sun Studio 11_x86: Patch for x86 update checking binary.
122184 02 < 03 --- 46 SunOS 5.10_x86: logadm timestamp patch
122205 01 < 02 --- 55 GNOME 2.6.0_x86: configuration framework patch
122213 17 < 18 -S- 35 GNOME 2.6.0_x86: GNOME Desktop Patch
122530 05 < 06 --- 61 SunOS 5.10_x86: nge patch
122647 03 < 04 --- 32 SunOS 5.10_x86: zlogin patch
122661 06 < 07 R-- 18 SunOS 5.10_x86: zones patch
123004 02 < 03 --- 55 SunOS 5.10_x86: SAM module patch
123591 03 < 05 RS- 11 SunOS 5.10_x86: PostgresSQL patch
123776 02 < 03 --- 15 SunOS 5.10_x86: pcplusmp driver patch
124253 01 < 03 --- 20 SunOS 5.10_x86: nfssrv patch
124255 03 < 04 --- 43 SunOS 5.10_x86: sockfs patch
124259 01 < 05 RS- 12 SunOS 5.10_x86: ufs and nfs driver patch
124631 03 < 07 R-- 47 SunOS 5.10_x86: System Administration Applications, Network, and C
124859 -- < 01 --- 42 Patch for SS11_x86 debuginfo handling
124923 02 < 03 RS- 57 SunOS 5.10_x86: ld.so.1 patch
125015 02 < 03 --- 36 SunOS 5.10_x86: IP filter patch
125038 03 < 06 --- 39 SunOS 5.10_x86: mpt driver patch
125101 04 < 08 RS- 12 SunOS 5.10_x86: Kernel Update patch
125107 -- < 01 --- 57 SunOS 5.10_x86: Thermal zone monitor patch
125111 -- < 01 --- 57 SunOS 5.10_x86: cut patch
125113 -- < 01 --- 60 SunOS 5.10_x86: iostat patch
125115 -- < 01 --- 57 SunOS 5.10_x86: cpustat patch
125117 -- < 01 --- 60 SunOS 5.10_x86: dld driver patch
125119 -- < 01 --- 60 SunOS 5.10_x86: netstat patch
125121 -- < 02 --- 32 SunOS 5.10_x86: e1000g driver patch
125130 -- < 01 --- 25 SunOS 5.10_x86: specfs patch
125165 01 < 02 --- 60 SunOS 5.10_x86: Qlogic ISP Fibre Channel Device Driver
125185 -- < 03 --- 6 SunOS 5.10_x86: Sun Fibre Channel Device Drivers
125412 -- < 01 --- 25 SunOS 5.10_x86: bge driver patch
125429 01 < 02 --- 56 SunOS 5.10_x86: Kerberos patch
125466 -- < 02 --- 50 SunOS 5.10_x86: PKCS provider patch
125475 -- < 01 --- 25 X11 6.8.0_x86: Xorg client libraries patch
125532 -- < 01 --- 6 Gnome 2.6.0_x86: File System Examiner Patch
125720 -- < 03 RS- 12 X11 6.8.0_x86: Xorg server patch
125726 -- < 02 --- 18 X11 6.6.2_x86: xinerama patch
125732 -- < 01 --- 50 SunOS 5.10_x86: XML and XSLT libraries patch
125794 -- < 01 --- 56 SunOS 5.10_x86: cryptmod patch
125798 -- < 01 --- 60 SunOS 5.10_x86: libaio.so.1 patch
125805 -- < 01 --- 22 SunOS 5.10_x86: uucp patch
125809 -- < 01 --- 60 SunOS 5.10_x86: sendmail patch
125811 -- < 01 --- 40 SunOS 5.10_x86: fsckall patch
125910 -- < 01 --- 50 SunOS 5.10_x86: libcurses patch
125912 -- < 01 --- 49 SunOS 5.10_x86: prctl fails to set resource controls on some proce
125913 -- < 01 --- 18 SunOS 5.10_x86: ixgb port tx hang on Sun Fire X4x00
126118 -- < 01 --- 57 CDE 1.6_x86: DtPower patch
126120 -- < 01 --- 19 CDE 1.6_x86: sys-suspend patch
126207 -- < 01 --- 57 SunOS 5.10_x86: zebra ripd patch
126547 -- < 01 --- 25 SunOS 5.10_x86: Bash patch
126631 -- < 01 --- 25 SunOS 5.10_x86: tcsh patch

Comments

  • Prolem is reproducable with test/IceUtil/ctrlCHandler/client

    We have added the following trace to CtrlCHandler.cpp:

    + printf("before sigwait\n");
    int rc = sigwait(&ctrlCLikeSignals, &signal);
    + printf("after sigwait\n");


    --
    Run of test/IceUtil/ctrlCHandler/client without pressing Ctrl+C:

    -bash-3.00$ build/Ice-3.1.0/test/IceUtil/ctrlCHandler/client
    First ignore CTRL+C and the like for 10 seconds (try it!)
    before sigwait
    Then handling them for another 30 seconds (try it)
    Segmentierungsfehler (core dumped)


    --
    Now pressing CTRL+C in the first 10 seconds:

    -bash-3.00$ build/Ice-3.1.0/test/IceUtil/ctrlCHandler/client
    First ignore CTRL+C and the like for 10 seconds (try it!)
    before sigwait
    ^Cafter sigwait
    before sigwait
    Then handling them for another 30 seconds (try it)
    Segmentierungsfehler (core dumped)


    --
    Now pressing CTRL+C after the first 10 seconds:

    -bash-3.00$ build/Ice-3.1.0/test/IceUtil/ctrlCHandler/client
    First ignore CTRL+C and the like for 10 seconds (try it!)
    before sigwait
    Then handling them for another 30 seconds (try it)
    ^Cafter sigwait
    Handling signal 2
    before sigwait
    Segmentierungsfehler (core dumped)


    --
    Hope that helps anyhow.

    Markus
  • bernard
    bernard Jupiter, FL
    Hi Markus,

    That's an interesting problem, and it is most certainly pthread-related.

    On POSIX platforms, the CtrlCHandler class blocks a number of CTRL-C like signals and starts a background thread to "catch" these signals and run an application-registered callback function upon catching such a signal (typically the callback shuts down or destroys a communicator).

    When the CtrCHandler object is destroyed, it cancels this background thread, using thread cancellation (that's the only part of Ice that uses thread cancellation). This works well on all our platforms so far (with a little work-around for MacOS X).

    But apparently on your patched Solaris x86 system, this thread cancellation triggers a core dump in the cancelled thread :( .

    If you can't find a Sun patch that restores thread cancellation, you could try to update the CtrlCHandler code to avoid thread cancellation altogether, e.g. the CtrlCHandler destructor would set a flag and then pthread_kill the sigwait thread; when the sigwait thread wakes up, it would check this flag and exit.

    If you like, I'd be happy to prepare such a CtrlCHandler patch for you.

    Best regards,
    Bernard
  • Workaround

    Hi Bernard,

    I would be happy to try this workaround.

    Another question. Is it possible that this thing has never worked 100% safe under solaris. The reason I'm asking this is that we started patching everything (OS, compiler, libraries, ICE, boost, etc.) because we are hunting a even more mysterious bug. We are using heavily smart pointers (boost::scoped_ptr). Our problem is, that the profiler shows us memory leaks at program exit. Memory blocks which should have be deleted by the smart pointers.

    A wild (stupid?) guess. Is it possible, that this signal cancels other threads, while they are running through the destructors of the smart pointers??

    Without core dumping the program with an older OS patch level??

    Thanks for any help,
    Markus
  • bernard
    bernard Jupiter, FL
    Hi Markus,

    Ice uses thread cancellation only for this "sigwait" thread, and there is no possible memory leak in this code (the thread is cancelled while blocked on sigwait, not while it runs the callback).

    Naturally, using thread cancellation on a thread that has allocated various things is a bad idea and could result in leaks or even deadlocks (if the thread has some mutex locked). However, I doubt you use thread cancellation in your own code.

    Overall, it's unlikely that an OS bug causes memory leaks in your C++ code ... it's more likely a bug in your code, and otherwise in the C++ compiler.

    Best regards,
    Bernard
  • bernard
    bernard Jupiter, FL
    Of course, if previously the program exited silently (and apparently successfully) during this thread cancellation, you would get some leaks since the CtrlCHandler may not be the last object destroyed.

    This would be a very strange bug!

    Cheers,
    Bernard
  • Hi Bernard,

    - Naturally, using thread cancellation on a thread that has allocated various
    - things is a bad idea and could result in leaks or even deadlocks (if the
    - thread has some mutex locked). However, I doubt you use thread
    - cancellation in your own code.

    Exact.

    - Overall, it's unlikely that an OS bug causes memory leaks in your
    - C++ code ... it's more likely a bug in your code, and otherwise in
    - the C++ compiler.

    I thought you say that. :cool:

    Best regards,
    Markus
  • dbx session

    Hi Bernard,

    I thought about what you wrote. I started a dbx session and set a breakpoint into the destructor of boost::scoped_ptr.

    Then I shutted down the application.

    t@1 (l@1) stopped in boost::scoped_ptr<de::scmb::bm::core::Instance>::~scoped_ptr at line 77 in file "scoped_ptr.hpp"
    77 boost::checked_delete(ptr);
    (dbx) cont
    t@1 (l@1) stopped in boost::scoped_ptr<de::scmb::bm::core::CommandObjectFactory>::~scoped_ptr at line 77 in file "scoped_ptr.hpp"
    77 boost::checked_delete(ptr);
    (dbx) cont
    t@2 (l@2) signal SEGV (no mapping at the fault address) in _dlamd64getunwind at 0xfffffd7fff3dd521
    0xfffffd7fff3dd521: _dlamd64getunwind+0x0061: movq 0x00000000000000d0(%r14),%rdi
    Current function is sigwaitThread
    126 int rc = sigwait(&ctrlCLikeSignals, &signal);
    (dbx) threads
    t@1 a l@1 ?() running in __lwp_wait()
    *> t@2 a l@2 sigwaitThread() signal SIGSEGV in _dlamd64getunwind()


    Thread 1 is starting the chain of destructors.
    Thread 2 is getting the SIGCANCEL

    I don't understand where this SIGCANCEL comes from at this time. My application is a deamon derived from Ice::Service. I have overloaded the method "bool Ice::Service::stop()" to shutdown my server instance. Can the signal get raised, before stop() ends?

    Ups. Perhaps I found the other bug now... :eek:

    Markus
  • bernard
    bernard Jupiter, FL
    Hi Markus,

    Ice::Service creates a CtrlCHandler and destroys it in its destructor (the CtrlCHandler calls pthread_cancel on the sigwait thread).

    So you should see this CANCEL signal when the last Ptr on Service is released.

    I'll post a patch a little later (after some testing) that removes this thread cancellation in CtrlCHandler.

    Cheers,
    Bernard
  • bernard
    bernard Jupiter, FL
    Hi Markus,

    I attached a CtrlCHandler.cpp that does not use thread cancellation. It's a binary compatible change so you just need to rebuild IceUtil.

    This CtrCHandler.cpp is derived from Ice 3.2.0 (and tested only with 3.2.0) but it probably works with 3.1.x as well.

    Cheers,
    Bernard
  • Hi Bernard,

    thank you very, very much for this workaround. :D

    Our programs are not core dumping anymore at exit. But now theres another little drawback. If I run a Ice application in dbx, it will not exit at all.


    -bash-3.00$ dbx build/Ice-3.1.0/test/IceUtil/ctrlCHandler/client
    For information about new features see `help changes'
    To remove this message, put `dbxenv suppress_startup_message 7.6' in your .dbxrc
    Reading client
    Reading ld.so.1
    Reading libIceUtil.so.3.1.0
    Reading libpthread.so.1
    Reading libstlport.so.1
    Reading librt.so.1
    Reading libCrun.so.1
    Reading libm.so.2
    Reading libthread.so.1
    Reading libc.so.1
    Reading libaio.so.1
    Reading libmd5.so.1
    (dbx) run
    Running: client
    (process id 24100)
    First ignore CTRL+C and the like for 10 seconds (try it!)
    Then handling them for another 30 seconds (try it)
    t@2 (l@2) signal TERM (Beendet) in ___sigtimedwait at 0xfffffd7fff06d7ea
    0xfffffd7fff06d7ea: ___sigtimedwait+0x000a: jb __cerror [ 0xfffffd7ffefe8800, .-0x84fea ]
    Current function is sigwaitThread
    139 int rc = sigwait(&ctrlCLikeSignals, &signal);
    (dbx) cont
    ^Cdbx: warning: Interrupt ignored but forwarded to child.
    t@2 (l@2) signal INT (Unterbrechung) in ___sigtimedwait at 0xfffffd7fff06d7ea
    0xfffffd7fff06d7ea: ___sigtimedwait+0x000a: jb __cerror [ 0xfffffd7ffefe8800, .-0x84fea ]
    Current function is sigwaitThread
    139 int rc = sigwait(&ctrlCLikeSignals, &signal);
    (dbx) kill


    Do I have to use a special dbx command to get over it, or can this be fixed in CtrlCHandler.

    Thanks for your help,
    Markus
  • bernard
    bernard Jupiter, FL
    Hi Markus,

    I tried with dbx on Solaris / SPARC and found two work-arounds:
    - in dbx, you can ignore 'SIGTERM' with
    (dbx) ignore SIGTERM
    This way SIGTERM is properly sent to the sigwait thread when running in dbx

    - or you can update CtrlCHandler.cpp to send a SIGHUP (instead of SIGTERM) to kill the sigwait thread. It looks like dbx does not catch SIGHUP.

    Best regards,
    Bernard
  • PERFECT!!!

    Once again, thank you, thank you, thank you.

    Markus
  • Solaris patch 124923-03 broken ???

    @Bernard: Thanks for your time and passion.

    @all: The Solaris patch "124923-03 SunOS 5.10_x86: ld.so.1 patch" seems to be broken (at least for me on my machine). After deinstalling it with "patchrm 124923-03" and a full recompile of all libraries this error goes away. The patched CtrlCHandler is not needed anymore.

    Best regards,
    Markus