Bus Error under Mac OX 10.4 and IcePy 3.3.0

joshmoore · August 2008

(The following also applies to Ice 3.2.1)

We have the following setup:

Java server running inside of IceGrid, behind Glacier2
Trivial python client accessing a Glacier2 session

The client logs in and can interact with the server normally. When trying to acquire one particular servant (of many) via the session facade, the client segfaults and "Bus error" is printed.

The gdb output for the client is:

(gdb) run buserror.py
Starting program: /opt/local/bin/python buserror.py
Reading symbols for shared libraries . done

Program received signal SIGTRAP, Trace/breakpoint trap.
0x8fe01010 in __dyld__dyld_start ()
(gdb) c
Continuing.
Reading symbols for shared libraries ... done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries . done
Reading symbols for shared libraries ...... done
08/12/08 16:58:37.966 warning: deprecated property: Ice.MonitorConnections
Now

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
0x007224e9 in IcePy::SyncTypedInvocation::invoke ()
(gdb) bt
#0  0x007224e9 in IcePy::SyncTypedInvocation::invoke ()
#1  0x0071b6fb in operationInvoke ()
#2  0x0027fa7e in PyEval_EvalFrame (f=0x6083f0) at Python/ceval.c:3568
#3  0x00280619 in PyEval_EvalCodeEx (co=0x13a6620, globals=0x5b14b0, locals=0x0, args=0x608cfc, argcount=1, kws=0x608d00, kwcount=0, defs=0x13f263c, defcount=1, closure=0x0) at Python/ceval.c:2741
#4  0x0027e453 in PyEval_EvalFrame (f=0x608bb0) at Python/ceval.c:3661
#5  0x00280619 in PyEval_EvalCodeEx (co=0x4fe60, globals=0x20a50, locals=0x20a50, args=0x0, argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at Python/ceval.c:2741
#6  0x00280859 in PyEval_EvalCode (co=0x4fe60, globals=0x20a50, locals=0x20a50) at Python/ceval.c:484
#7  0x002a75b1 in PyRun_FileExFlags (fp=0xa000bda0, filename=0xbfffd8f3 "buserror.py", start=257, globals=0x20a50, locals=0x20a50, closeit=1, flags=0xbfffd608) at Python/pythonrun.c:1287
#8  0x002a7894 in PyRun_SimpleFileExFlags (fp=0xa000bda0, filename=0xbfffd8f3 "buserror.py", closeit=1, flags=0xbfffd608) at Python/pythonrun.c:871
#9  0x002b13dc in Py_Main (argc=1, argv=0xbfffd688) at Modules/main.c:493
#10 0x000018ee in ?? ()
#11 0x00001815 in ?? ()
(gdb)

The client code is:

import omero
c = omero.client()
s = c.createSession()
print "Now"
s.getScriptService()

The bus error is triggered at line 263 in IceInternal.ServantManager.java. (connection.sendResponse) and no exception shows up in the server.

An interesting corollary is that the log file for the server is halted (for lack of a better term) on the first bus error. I.e. our log statement "Creating servant" is printed, and afterwards nothing else even though further interactions are possible with the server.

benoit · August 2008

Hi,

Note that we no longer support Mac OS X 10.4 for Ice 3.3.0. Which python version do you use and did you build python yourself? Did you try running the IcePy tests to see if they worked?

The best would be to try on Mac OS X 10.5 using python installed with the system and see if you can reproduce the problem.

Cheers,
Benoit.

joshmoore · August 2008

Comparison to Java

Similar code in Java hangs but does not crash.

public class buserror {
    public static void main(String[] args) throws Exception {
        omero.client c = new omero.client();
        omero.api.ServiceFactoryPrx s = c.createSession();
        System.out.println("first");
        s.getQueryService();
        System.out.println("now");
        s.getScriptService();
    }
}

produces:

first
now

Sending kill -QUIT to the process returns:

Full thread dump Java HotSpot(TM) Client VM (1.5.0_13-121 mixed mode, sharing):

"DestroyJavaVM" prio=5 tid=0x00501320 nid=0x1803800 waiting on condition [0x00000000..0xb0800000]

"Ice.SelectorThread" prio=5 tid=0x00511b70 nid=0x1969000 runnable [0xb0e8d000..0xb0e8dd10]
        at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
        at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:118)
        at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:69)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0x266200d0> (a sun.nio.ch.Util$1)
        - locked <0x266200e0> (a java.util.Collections$UnmodifiableSet)
        - locked <0x26620088> (a sun.nio.ch.KQueueSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
        at IceInternal.Selector.select(Selector.java:139)
        at IceInternal.SelectorThread.run(SelectorThread.java:120)
        at IceInternal.SelectorThread$HelperThread.run(SelectorThread.java:273)

"Ice.ThreadPool.Client-0" prio=5 tid=0x00511860 nid=0x1968c00 runnable [0xb0e0c000..0xb0e0cd10]
        at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
        at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:118)
        at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:69)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0x26620228> (a sun.nio.ch.Util$1)
        - locked <0x26620238> (a java.util.Collections$UnmodifiableSet)
        - locked <0x266201e0> (a sun.nio.ch.KQueueSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
        at IceInternal.Selector.select(Selector.java:139)
        at IceInternal.ThreadPool.run(ThreadPool.java:335)
        at IceInternal.ThreadPool.access$100(ThreadPool.java:12)
        at IceInternal.ThreadPool$EventHandlerThread.run(ThreadPool.java:951)

"Ice.EndpointHostResolverThread" prio=5 tid=0x00511230 nid=0x199ba00 in Object.wait() [0xb0d8b000..0xb0d8bd10]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x26620390> (a IceInternal.EndpointHostResolver)
        at java.lang.Object.wait(Object.java:474)
        at IceInternal.EndpointHostResolver.run(EndpointHostResolver.java:77)
        - locked <0x26620390> (a IceInternal.EndpointHostResolver)
        at IceInternal.EndpointHostResolver$HelperThread.run(EndpointHostResolver.java:146)

"Ice.ThreadPool.Server-0" prio=5 tid=0x005108c0 nid=0x199be00 runnable [0xb0d0a000..0xb0d0ad10]
        at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method)
        at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:118)
        at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:69)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
        - locked <0x26620478> (a sun.nio.ch.Util$1)
        - locked <0x26620488> (a java.util.Collections$UnmodifiableSet)
        - locked <0x26620430> (a sun.nio.ch.KQueueSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
        at IceInternal.Selector.select(Selector.java:139)
        at IceInternal.ThreadPool.run(ThreadPool.java:335)
        at IceInternal.ThreadPool.access$100(ThreadPool.java:12)
        at IceInternal.ThreadPool$EventHandlerThread.run(ThreadPool.java:951)

"Ice.Timer" prio=5 tid=0x0050fa70 nid=0x198ec00 in Object.wait() [0xb0c89000..0xb0c89d10]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x26b1fd98> (a IceInternal.Timer)
        at IceInternal.Timer.run(Timer.java:217)
        - locked <0x26b1fd98> (a IceInternal.Timer)

"Low Memory Detector" daemon prio=5 tid=0x00508f60 nid=0x1816800 runnable [0x00000000..0x00000000]

"CompilerThread0" daemon prio=9 tid=0x00508560 nid=0x1816400 waiting on condition [0x00000000..0xb0b06748]

"Signal Dispatcher" daemon prio=9 tid=0x00508090 nid=0x1813200 waiting on condition [0x00000000..0x00000000]

"Finalizer" daemon prio=8 tid=0x00507830 nid=0x1811600 in Object.wait() [0xb0a04000..0xb0a04d10]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x26a949e0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
        - locked <0x26a949e0> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00507480 nid=0x1810800 in Object.wait() [0xb0983000..0xb0983d10]
        at java.lang.Object.wait(Native Method)
        - waiting on <0x26a94a60> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:474)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
        - locked <0x26a94a60> (a java.lang.ref.Reference$Lock)

"VM Thread" prio=9 tid=0x00506c00 nid=0x1810400 runnable 

"VM Periodic Task Thread" prio=9 tid=0x00509b30 nid=0x1816c00 waiting on condition 

"Exception Catcher Thread" prio=10 tid=0x005015a0 nid=0x1803c00 runnable

joshmoore · August 2008

Reproduced under Ubuntu/Hardy with Ice 3.2.1

Using the standard packages under hardy (currently 3.2.1) the Python example above segfaults.

joshmoore · August 2008

Sorry, our posts crossed each other, Benoit.

benoit wrote: »

Note that we no longer support Mac OS X 10.4 for Ice 3.3.0. Which python version do you use and did you build python yourself? Did you try running the IcePy tests to see if they worked?

This is Python 2.4.5 from MacPorts. I did not try to run the tests, since I'm also using the Ice port which has always worked admirably. But I will do that now.

The best would be to try on Mac OS X 10.5 using python installed with the system and see if you can reproduce the problem.

Unfortunately that's not an option in the immediate term. I might should add that this setup was working under 3.2.1 on 10.4 and only started to segfault when I switched the server to AMD (This particular method, however, is not AMD, making it possibly unrelated.) The switch to 3.3.0 made no difference.

joshmoore · August 2008

python allTests.py passes using the 3.3.0 ports under 10.4.

benoit · August 2008

Hi Josh,

I suspect the Java client hangs because you don't destroy the communicator (and therefore threads created by the communicator aren't terminated and the JVM hangs on exit waiting for these threads to be gone). I'm afraid it's difficult to know what could be the problem with so little information. The fact that you changed how the server is implemented shouldn't have any effects on your Python client.

Could you perhaps narrow down the crash to a small self contained test case that we could use to try to reproduce it?

Btw, are you running this code on an Intel platform? Did you try setting the DYLD_BIND_AT_LAUNCH=1 environment variable to see it made a difference (we used to experience random crashes on 10.4 without this variable set in the environment).

Cheers,
Benoit.

joshmoore · August 2008

benoit wrote: »

I suspect the Java client hangs because you don't destroy the communicator (and therefore threads created by the communicator aren't terminated and the JVM hangs on exit waiting for these threads to be gone).

Apologies, you're of course right. In my attempt to make a minimal code sample, the shutdown was left out. The Java code does not hang.

benoit wrote: »

I'm afraid it's difficult to know what could be the problem with so little information. The fact that you changed how the server is implemented shouldn't have any effects on your Python client.

I definitely agree that it shouldn't make a difference, but it's so odd that other facade methods should work normally. The only other substantial interface difference is that the script service is forward declared (without ["ami","amd"] while the definition is in another file (with ["amd","ami"]). Am I grasping at straws?

benoit wrote: »

Could you perhaps narrow down the crash to a small self contained test case that we could use to try to reproduce it?

I'm certainly working on it, but as always, there are a terrible number of moving parts. Finding the one that has quite abruptly started causing a segfault isn't proving easy.

benoit wrote: »

Btw, are you running this code on an Intel platform? Did you try setting the DYLD_BIND_AT_LAUNCH=1 environment variable to see it made a difference (we used to experience random crashes on 10.4 without this variable set in the environment).

I had not tried DYLD_BIND_AT_LAUNCH, but it made no difference.

Thanks,
~J.

joshmoore · August 2008

By the way, here's the Ubuntu/Hardy backtrace:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f30e04986e0 (LWP 22916)]
0x00007f30dee2281a in IcePy::OperationI::invoke (this=0x8c76c0, proxy=@0x7fffe84b8c20, args=0x7f30e0458050, pyctx=0x702580) at Operation.cpp:570
570                         Py_INCREF(ret);
(gdb) bt
#0  0x00007f30dee2281a in IcePy::OperationI::invoke (this=0x8c76c0, proxy=@0x7fffe84b8c20, args=0x7f30e0458050, pyctx=0x702580) at Operation.cpp:570
#1  0x00007f30dee1d71c in operationInvoke (self=0x7f30dd7342d0, args=0x7f30e0436a50) at Operation.cpp:206
#2  0x0000000000477146 in PyEval_EvalFrame ()
#3  0x00000000004784f5 in PyEval_EvalCodeEx ()
#4  0x000000000047780d in PyEval_EvalFrame ()
#5  0x00000000004784f5 in PyEval_EvalCodeEx ()
#6  0x0000000000478622 in PyEval_EvalCode ()
#7  0x000000000049ccca in PyRun_FileExFlags ()
#8  0x000000000049cf19 in PyRun_SimpleFileExFlags ()
#9  0x00000000004125c9 in Py_Main ()
#10 0x00007f30df6b71c4 in __libc_start_main () from /lib/libc.so.6
#11 0x0000000000411ae9 in _start ()
(gdb)

benoit · August 2008

Hi,

It crashes when the getScriptService operation is invoked, is this correct? It looks like the crash occurs when un-marshalling the response from the server. What is the signature of this Slice operation?

Cheers,
Benoit.

joshmoore · August 2008

Correct.

        interface ServiceFactory extends Glacier2::Session
        {
            ...
            IScript*   getScriptService() throws ServerError;

with the ... all of the same type:

            IAdmin*    getAdminService() throws ServerError;
            IConfig*   getConfigService() throws ServerError;
            IPixels*   getPixelsService() throws ServerError;
            IPojos*    getPojosService() throws ServerError;
            IQuery*    getQueryService() throws ServerError;
            // etc.

all of which function normally. The implementation (just for reference) is:

   public IScriptPrx getScriptService(Ice.Current current) throws ServerError {
        Ice.Identity id = getIdentity(current, SCRIPTSERVICE.value);
        Ice.Object servant = current.adapter.find(id);
        if (servant == null) {
            servant = (Ice.Object) this.context.getBean(SCRIPTSERVICE.value);
            if (servant instanceof Ice.TieBase) {
                Ice.TieBase tie = (Ice.TieBase) servant;
                Object obj = tie.ice_delegate();
                if (obj instanceof ServiceFactoryAware) {
                    ((ServiceFactoryAware) obj).setServiceFactory(this);
                }
            }
            current.adapter.add(servant, id);
        }
        Ice.ObjectPrx prx = current.adapter.createProxy(id);
        return IScriptPrxHelper.checkedCast(prx);
    }

benoit · August 2008

Hi Josh,

I believe I was able to reproduce. The problem shows up if the client doesn't load the Slice definition for the proxy type returned by the invocation.

Do you forward declare the IScript interface in the definition of the ServiceFactory interface and don't load the definition of the IScript interface in the python client? If that's the case, I believe loading the Slice file where the IScript interface is defined should solve the issue.

Cheers,
Benoit.

joshmoore · August 2008

You're right. I had wondered about this after my last post, but unfortunately hadn't been able to test it during lunch. Thanks for being faster.

Do you have an idea why the server log would freeze when the client segfaults, and if there is a possible workaround?

benoit · August 2008

Could you detail a little more how logging works in your Java server? Is this just straight logging on the standard output? Can the server still process incoming requests after this "log freeze" occurs?

Cheers,
Benoit.

joshmoore · August 2008

Hi Benoit,

The Java server logs via commons-logging/log4j to standard out which is caught by icegridnode and printed to var/log/master.out:

#IceGrid.Node.Output=var/log
#IceGrid.Node.RedirectErrToOut=1

# Defines that the standard out from this node and all
# contained servers will be saved in the following files.
# If Output is set, then output from the servers will be
# redirected _after_ Ice.Communicator creation.
Ice.StdOut=var/log/master.out
Ice.StdErr=var/log/master.err

The only other possibly related tidbit is that the Java process is started via python's os.execvpe (where icegridnode runs "python ...").

And yes, the server still behaves completely normally except for stdout. The only way to "restart" the log that I've found is restarting the server.

benoit · August 2008

I don't see why this problem in the client would cause log4j to stop working. On the server side, the only thing that might happen when the client crashes is a connection warning. This warning is printed with the communicator logger which by default prints to stderr. This warning only shows up if you have Ice.Warn.Connections set to 1. If that's the case you could try disabling it to see if it makes a difference.

Cheers,
Benoit.

joshmoore · August 2008

The log was seemingly a red-herring. Due to a 3.3.0 change (the removal of Ice.stringToIdentity), a python server component was failing. Fixing that corrected the Java log, which still doesn't make sense but it is comforting that it had nothing to do with a client segfault.

If the log issue occurs again, I'll open another bug thread.

Otherwise, thanks for all the help.
Cheers,
~Josh.

Archived

Bus Error under Mac OX 10.4 and IcePy 3.3.0

Comments

Categories