Ice::Communicator not initiating a connection

afmafm Member Andreas MüllerOrganization: Hochschule RapperswilProject: Distributed objects for Astrophotography

I'm lost with the following problem. My application (a server process written in C++ running on ARM based Linux computers controlling astrophotography equipment) uses multiple servers, one of which is contacted by a GUI client. This server then coordinates access to objects on other servers and controls the complete astrophotography workflow. This works very nicely with Ice-3.7.1 on Mac OS X. However, on Ubuntu 16.04, the server never performs any network related activity but simply blocks in checkedCast. If I replace checkedCast by uncheckedCast, it simply blocks in the next call that requires a connection to be set up. I looks like the Ice::Communicator does not want to initiate a connection.

Some more details:

  1. I have made sure that the server thread pool is not limited to a single thread, but this did not change anything.
  2. The server seems to be able to talk back to the GUI client (some monitoring functions use callbacks to deliver server status changes to the GUI client).
  3. The same connect string "Modules:default -h mount.local. -p 10000" works when used in a standalone program on the Linux system, i.e. the problem shows up only in the context of the server process.
  4. The host name "mount.local." can be resolved, pinged and sshd to.
  5. Tracing the activities of the server process with strace shows that it never does any network related functions: no name resolution, it does not create a socket, it does not try to connect. It is just waiting in some futex system call presumably for some lock or some system activity to complete, but I have no idea which activity that could be.

So: what can I do to
a) understand what causes the call to checkCast to fail
b) fix the problem hopefully?

Tagged:

Answers

  • xdmxdm La Coruña, SpainAdministrators, ZeroC Staff Jose Gutierrez de la ConchaOrganization: ZeroC, Inc.Project: Ice Developer ZeroC Staff

    Hi Andreas,

    Can you provide a thread dump of the server and the output with Ice.Trace.Network=3?

  • afmafm Member Andreas MüllerOrganization: Hochschule RapperswilProject: Distributed objects for Astrophotography

    How do I get a thread dump?

  • xdmxdm La Coruña, SpainAdministrators, ZeroC Staff Jose Gutierrez de la ConchaOrganization: ZeroC, Inc.Project: Ice Developer ZeroC Staff

    You can use gdb to get a thread dump, once the process is running and get to the point that you think is blocked and doesn't make any progress do:

    gdb
    attach <pid>
    thread apply all bt
    
  • afmafm Member Andreas MüllerOrganization: Hochschule RapperswilProject: Distributed objects for Astrophotography

    Thanks, I'll be on site tomorrow and will be able to get that. Thanks for your help.

  • afmafm Member Andreas MüllerOrganization: Hochschule RapperswilProject: Distributed objects for Astrophotography
    edited October 2018

    It took a little longer to analyze the problem. Here are the results.

    There are two server instances, both using ice-3.7.1, one on a machine named mount.local on Linux ARM (Ubuntu 16.04), and one on a machine called margaret.local on macos 10.13. A client connects to mount.local and expects that the service to connect to the other server, collect some data, consolidate it and return it to the caller. However, the server on mount.local seems not to be willing to connect to the other server. Turning the roles around seems to make the thing work.

    The full thread dump is attached as file threaddump.txt, the ice debug messages are attached as icedebug.txt. Here are the (I believe) relevant portions.

    The server stalls when calling the getDevicelist method, which the ICE debug output shows as arriving:

        -- 10/22/18 21:45:57.972 /usr/local/astro/sbin/snowstar: Protocol: received request
       message type = 0 (request)
       compression status = 0 (not compressed; do not compress response, if any)
       message size = 61
       request id = 8
       identity = devicelocator/nice
       facet =
       operation = getDevicelist
       mode = 0 (normal)
       context =
       encoding = 1.1
    

    Here is the code inside the getDevicelist implementation method where the program stalls (getDevicelist first calls getModules() which contains this code fragment, both show up in the thread stack below):

                    debug(LOG_DEBUG, DEBUG_LOG, 0, "connect string: '%s'", connectstring.c_str());
                    Ice::ObjectPrx  base = ic->stringToProxy(connectstring);
                    debug(LOG_DEBUG, DEBUG_LOG, 0, "connecting to Modules: %p", base.get());
                    mprx = snowstar::ModulesPrx::checkedCast(base);
                    debug(LOG_DEBUG, DEBUG_LOG, 0, "got modules proxy");
    

    The debug log shows the message immediately before the checkedCast call. The connect string is

        Oct 22 21:45:58.079 snowstar[4363/4] NiceLocator.cpp:143: connect string: 'Modules:default -h margaret.local -p 10000'
        Oct 22 21:45:58.079 snowstar[4363/4] NiceLocator.cpp:146: connecting to Modules: 0xb0600d50
    

    and the hostname margaret.local is perfectly resolvable, in fact, from some simple test program it works perfectly. The thread dump also shows that the program is stuck in the checkedCast method, here is part of the relevant thread stack:

    Thread 5 (Thread 0xb2c302f0 (LWP 4414)):
    #0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
    #1  0xb665adba in __pthread_cond_wait (cond=0xb0600d98, mutex=0xb0600dc8)
        at pthread_cond_wait.c:186
    #2  0xb6818d36 in void IceUtil::Cond::waitImpl<IceUtil::Mutex>(IceUtil::Mutex const&) const () from /usr/local/Ice/lib/arm-linux-gnueabihf/libIce.so.37
    #3  0xb6815b66 in IceUtil::Monitor<IceUtil::Mutex>::wait() const ()
       from /usr/local/Ice/lib/arm-linux-gnueabihf/libIce.so.37
    #4  0xb68a1454 in IceInternal::OutgoingAsyncBase::_waitForResponse() ()
       from /usr/local/Ice/lib/arm-linux-gnueabihf/libIce.so.37
    #5  0xb68e2c64 in IceProxy::Ice::Object::end_ice_isA(IceInternal::Handle<Ice::AsyncResult> const&) () from /usr/local/Ice/lib/arm-linux-gnueabihf/libIce.so.37
    #6  0xb0b659c6 in IceProxy::Ice::Object::ice_isA (
        context=std::map with 0 elements, typeId=..., this=0xb0600d50)
        at /usr/local/Ice/include/Ice/Proxy.h:1758
    #7  IceInternal::checkedCastImpl<IceInternal::ProxyHandle<IceProxy::snowstar::Modules> > (b=..., context=std::map with 0 elements)
        at /usr/local/Ice/include/Ice/Proxy.h:3264
    #8  0xb0b62a74 in IceInternal::checkedCastHelper<IceProxy::snowstar::Modules, IceProxy::Ice::Object> (ctx=..., b=...)
        at /usr/local/Ice/include/Ice/ProxyHandle.h:103
    #9  IceInternal::ProxyHandle<IceProxy::snowstar::Modules>::checkedCast<IceProxy::Ice::Object> (ctx=..., r=...) at /usr/local/Ice/include/Ice/ProxyHandle.h:268
    #10 astro::camera::nice::NiceLocator::getModules ([email protected]=0xb0600808,
        key=...) at NiceLocator.cpp:148
    #11 0xb0b63a5c in astro::camera::nice::NiceLocator::getDevicelist[abi:cxx11](astro::DeviceName::device_type, astro::discover::ServiceKey const&) (
        [email protected]=0xb0600808, device=astro::DeviceName::Camera, key=...)
        at NiceLocator.cpp:276
    #12 0xb0b63f24 in astro::camera::nice::NiceLocator::getDevicelist[abi:cxx11](astro::DeviceName::device_type) (this=0xb0600808,
        device=astro::DeviceName::Camera) at NiceLocator.cpp:327
    #13 0x00061100 in snowstar::DeviceLocatorI::getDevicelist[abi:cxx11](snowstar::devicetype, Ice::Current const&) (this=<optimized out>,
        type=snowstar::DevCAMERA) at DeviceLocatorI.cpp:22
    

    stack frame 11 is the getDevicelist implementation, #10 is the getModules method and #9 is the checkedCast call that is stalling.

    It should also be noted that although the thread dump suggests that the Ice library is waiting for a response from the server, there actually is no connection to the server at this point (netstat shows no TCP connection). One would also expect for such a connection to show up in the ICE debug output.

  • afmafm Member Andreas MüllerOrganization: Hochschule RapperswilProject: Distributed objects for Astrophotography

    Here are the files I promised to attach

  • afmafm Member Andreas MüllerOrganization: Hochschule RapperswilProject: Distributed objects for Astrophotography

    Here are the properties I was using:

    Ice.ThreadPool.Server.SizeMax 30
    Ice.ThreadPool.Client.SizeMax 30
    Ice.ThreadPool.Trace 1
    Ice.Trace.Network 3
    Ice.Trace.Locator 1
    Ice.Trace.Protocol 1
    
  • xdmxdm La Coruña, SpainAdministrators, ZeroC Staff Jose Gutierrez de la ConchaOrganization: ZeroC, Inc.Project: Ice Developer ZeroC Staff

    Hi Andreas,

    Did you build Ice for C++ 98 mapping with -std=c++11, we think this could be caused by build your application in C++11 mode and use the Ice run-time build in C++98 mode, see https://doc.zeroc.com/ice/3.7/release-notes/known-issues-and-platform-notes#id-.KnownIssuesandPlatformNotesv3.7-C++98Mappingand-stdoptiononLinuxARM

    A better long term solution will be to write your application using the C++11 mapping.

  • benoitbenoit Rennes, FranceAdministrators, ZeroC Staff Benoit FoucherOrganization: ZeroC, Inc.Project: Ice ZeroC Staff

    Also, according to your stack traces, tracing and given that there is no trace about the connection attempt to margaret.local, it would indicate that Ice is waiting for the resolution of the margaret.local name.

    Ice uses a dedicated thread to resolve hostnames but it looks like this thread vanished because it's not in the thread dump you provided (another thread is missing as well, the Ice timer thread). In theory, this can't happen even if an unexpected error occurs in a thread because in such a case the whole program should be terminated.

    The first thing you should check is the build options like José mentioned above. If this isn't the problem, we'll need to figure out why these threads disappear (something in your program kills the threads? you could check on server startup if these threads are present).

  • afmafm Member Andreas MüllerOrganization: Hochschule RapperswilProject: Distributed objects for Astrophotography

    Thanks for your suggestions, I'll be back at the telescope in about a week (we currently have bad weather and the full moon doesn't help either) and will then try to implement them.

  • afmafm Member Andreas MüllerOrganization: Hochschule RapperswilProject: Distributed objects for Astrophotography

    Ok, I recompiled Ice after manually cleaning the source tree, and now it works perfectly. Thanks very much for your help.

Sign In or Register to comment.