Archived

This forum has been archived. Please start a new discussion on GitHub.

BadMagicException on single server

Hi all!

1. We are using Ice 3.4.2 binaries, cluster of Windows Server 2008 R2 Enterprise machines. We have one server App with 16 worker threads each transferring (synchronously) files to Clients.
This produces a 100% load over 1Gigabit adapter (as we see in Task Manager). Every time we run our scenario we see this errors in log:

-! 10/26/11 11:20:42.139 DfsService.server.1: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0x05, 0x83, 0x11, 0x4e
local address = 10.65.60.91:50505
remote address = 10.65.60.91:49239

Notice - BadMagicException is on the single server.

Also regularly we see another warnings from Ice:

-! 10/26/11 11:23:45.734 DfsService.server.1: warning: connection exception:
TcpTransceiver.cpp:414: Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
local address = 10.65.60.91:50509
remote address = 10.65.60.92:59634

and

-! 10/26/11 11:38:02.272 DfsService.server.1: warning: connection exception:
Outgoing.cpp:226: Ice::TimeoutException:
timeout while sending or receiving data
local address = 10.65.60.91:51182
remote address = 10.65.60.92:59634

(we overrided timeout with 2 minutes for the server App).

We talked to our admins - they say everything is OK with hardware, though we think it is not.

Do you have any ideas what can cause this errors? Especially BadMagicException, which seems very, very strange to us.

2. Another question - we downloaded Ice 3.4.2 binaries for MSVC 2010, while our apps are built with MSVC 2010 SP1 - is this acceptable?

Thanks.
«1

Comments

  • mes
    mes California
    Hi Andy,

    I agree that the BadMagicException is the most serious issue. Does this only happen when the system is running at 100% CPU?

    This exception indicates that Ice has received an invalid packet on one of its connections. The most likely reason is a non-Ice program mistakenly attempting to connect and communicate with an Ice endpoint. It could also be caused by a concurrency bug in Ice.
    -! 10/26/11 11:20:42.139 DfsService.server.1: warning: connection exception:
    ConnectionI.cpp:1250: Ice::BadMagicException:
    unknown magic number: 0x05, 0x83, 0x11, 0x4e
    local address = 10.65.60.91:50505
    remote address = 10.65.60.91:49239
    The key piece of information here is the remote address. The first step is to verify whether it is an Ice program or a non-Ice program at the remote end. Assuming it's an Ice program, your next step should be to enable the following properties in both programs:

    Ice.Trace.Network=3
    Ice.Trace.Protocol=1

    Review the logs and see if anything grabs your attention, and feel free to post them here and we'll take a look. Also, if you can create a small test case that reproduces the exception, we'll gladly try that too.
    2. Another question - we downloaded Ice 3.4.2 binaries for MSVC 2010, while our apps are built with MSVC 2010 SP1 - is this acceptable?
    I built our C++ test suite with VS 2010SP1 while linking with the VS 2010 libraries from the 3.4.2 binary distribution. All the tests ran successfully, so I don't expect there will be any compatibility issues.

    Regards,
    Mark
  • Hi, Mark, thanks for answer!
    Does this only happen when the system is running at 100% CPU?
    Not the CPU, but 100% network load. We run our scenario approx. 30 times - high load was the first symptom we noticed.

    Let me tell more details about our system.
    We are using IceGrid containing 20 nodes. First node contains two servers, let's call them
    FileSource and DfsService. Other nodes contain one DfsAgent per node.
    As I mentioned, DfsService has 16 threads, each does a very simple thing - obtains proxy to a file from FileSource, then reads bytes from this proxy and writes these chunks to some DfsAgent.

    It turned out that BadMagicException appears when DfsService reads through file proxy bytes from FileSource.
    Other exceptions from my first post happen between DfsService and DfsAgents when writing bytes.

    Another thing - we deployed our system on two independent clusters - both discovered this behaviour.

    I try to write a small test case asap, also turn on Ice traces and review the logs.

    Upd.
    I've found similar post here http://www.zeroc.com/forums/help-center/5456-badmagicexception-crash-tcp-lost-data.html, but with no answers :(

    Regards,
    andy
  • It is me

    I posted "BadMagicException and Crash, Is TCP lost data?" 5 months ago.
    I haven used CommView(like WireShark, but can use in single computer) capture some network packets, and found more details, if you want the packets file, i will upload sometimes.
  • Hi!
    Caught again, here are traces.

    DfsService:
    -- 10/27/11 14:03:29.480 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
    local address = 10.65.60.91:57249
    remote address = 10.65.60.71:51148
    -- 10/27/11 14:03:29.480 DfsService.server.1: Network: received 1326 of 1326 bytes via tcp
    local address = 10.65.60.91:57609
    remote address = 10.65.60.91:57130
    -- 10/27/11 14:03:29.480 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
    local address = 10.65.60.91:57256
    remote address = 10.65.60.78:65416
    -- 10/27/11 14:03:29.480 DfsService.server.1: Protocol: received reply
    message type = 2 (reply)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 524318
    request id = 199018
    reply status = 0 (ok)
    -- 10/27/11 14:03:29.480 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
    local address = 10.65.60.91:57244
    remote address = 10.65.60.87:65036
    -- 10/27/11 14:03:29.481 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
    local address = 10.65.60.91:57256
    remote address = 10.65.60.78:65416
    -- 10/27/11 14:03:29.481 DfsService.server.1: Network: received 14 of 14 bytes via tcp
    local address = 10.65.60.91:57609
    remote address = 10.65.60.91:57130
    -! 10/27/11 14:03:29.481 DfsService.server.1: warning: connection exception:
    ConnectionI.cpp:1250: Ice::BadMagicException:
    unknown magic number: 0x10, 0x4f, 0x00, 0x0f
    local address = 10.65.60.91:57609
    remote address = 10.65.60.91:57130

    -- 10/27/11 14:03:29.481 DfsService.server.1: Network: closing tcp connection
    local address = 10.65.60.91:57609
    remote address = 10.65.60.91:57130
    -- 10/27/11 14:03:29.482 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
    local address = 10.65.60.91:57249
    remote address = 10.65.60.71:51148
    -- 10/27/11 14:03:29.482 DfsService.server.1: Network: sent 84 of 84 bytes via tcp
    local address = 10.65.60.91:57249
    remote address = 10.65.60.71:51148
    -- 10/27/11 14:03:29.482 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 87
    request id = 23901
    identity = DfsAgent.13
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =


    FileSource:
    -- 10/27/11 14:03:29.479 Coordinator.server.1: Protocol: received request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 199019
    identity = 304d6087-311a-4cd2-9761-4bb97849dd5c
    facet =
    operation = read
    mode = 0 (normal)
    context =
    -- 10/27/11 14:03:29.479 Coordinator.server.1: Network: sent 65536 of 65536 bytes via tcp
    local address = 10.65.60.91:57130
    remote address = 10.65.60.91:57609
    -- 10/27/11 14:03:29.480 Coordinator.server.1: Protocol: sending reply
    message type = 2 (reply)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 524318
    request id = 199010
    reply status = 0 (ok)
    -! 10/27/11 14:03:29.485 Coordinator.server.1: warning: connection exception:
    TcpTransceiver.cpp:341: Ice::ConnectionLostException:
    connection lost: The specified network name is no longer available.
    local address = 10.65.60.91:57130
    remote address = 10.65.60.91:57609

    -- 10/27/11 14:03:29.485 Coordinator.server.1: Network: closing tcp connection
    local address = 10.65.60.91:57130
    remote address = 10.65.60.91:57609
    -- 10/27/11 14:03:29.485 Coordinator.server.1: Network: accepted tcp connection
    local address = 10.65.60.91:57130
    remote address = 10.65.60.91:58377
    -- 10/27/11 14:03:29.485 Coordinator.server.1: Protocol: sending validate connection
    message type = 3 (validate connection)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 14
    -- 10/27/11 14:03:29.485 Coordinator.server.1: Network: sent 14 of 14 bytes via tcp
    local address = 10.65.60.91:57130
    remote address = 10.65.60.91:58377
    -- 10/27/11 14:03:29.486 Coordinator.server.1: Network: received 14 of 14 bytes via tcp
    local address = 10.65.60.91:57130
    remote address = 10.65.60.91:58377
    -- 10/27/11 14:03:29.486 Coordinator.server.1: Network: received 72 of 72 bytes via tcp
    local address = 10.65.60.91:57130
    remote address = 10.65.60.91:58377
    -- 10/27/11 14:03:29.486 Coordinator.server.1: Protocol: received request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1
    identity = 14a071e1-cb2b-45d3-b7f6-bb4b209b3b06
    facet =
    operation = read
    mode = 0 (normal)
    context =


    Still trying to make a small test program reproducing BadMagicException.

    Regards, andy
  • mes
    mes California
    Andy,

    Thanks for the log output.

    The ConnectionLostException is definitely expected in this situation. The sender receives this exception because the receiver ungracefully closes its end of the connection after detecting the "bad magic" condition.

    Can you check the Windows event log on this host to see whether any network-related events were logged around the time of the BadMagicException?

    Regards,
    Mark
  • mes
    mes California
    Andy,

    One other request: please review the discussion in this thread. Your situation sounds quite similar to the one described there.

    Regards,
    Mark
  • mes wrote: »
    Andy,

    One other request: please review the discussion in this thread. Your situation sounds quite similar to the one described there.

    Regards,
    Mark

    Hi, Mark!
    Funny thing happened - this post was created by my colleague as we started to develop the first version of our system :) Unfortunately, he left our team, but we still turn off all TCP offload options on new servers we configure - so this is not the case :(

    Regards, andy
  • you can try to limit the tcp packet size less than 65535

    just to try, split one lage ICE invoke to multi invokes.
  • Hi, Mark!
    mes wrote: »
    Can you check the Windows event log on this host to see whether any network-related events were logged around the time of the BadMagicException?

    No, nothing strange in Event log. System was working the whole night, new error WSAECONNRESET appeared in log:

    -! 10/28/11 04:55:29.600 DfsService.server.1: warning: connection exception:
    ConnectionI.cpp:1250: Ice::BadMagicException:
    unknown magic number: 0x80, 0xfa, 0x00, 0x00
    local address = 10.65.60.91:51352
    remote address = 10.65.60.91:59922
    -! 10/28/11 05:00:53.428 DfsService.server.1: warning: connection exception:
    TcpTransceiver.cpp:414: Ice::ConnectionLostException:
    connection lost: The specified network name is no longer available.
    local address = 10.65.60.91:54077
    remote address = 10.65.60.91:59922
    -! 10/28/11 06:17:34.248 DfsService.server.1: warning: connection exception:
    ConnectionI.cpp:1250: Ice::BadMagicException:
    unknown magic number: 0x85, 0xc0, 0x74, 0x6a
    local address = 10.65.60.91:54165
    remote address = 10.65.60.91:59922
    -! 10/28/11 06:35:00.397 DfsService.server.1: warning: connection exception:
    Outgoing.cpp:226: Ice::TimeoutException:
    timeout while sending or receiving data
    local address = 10.65.60.91:60950
    remote address = 10.65.60.74:53596
    -! 10/28/11 06:35:00.404 DfsService.server.1: warning: connection exception:
    TcpTransceiver.cpp:163: Ice::ConnectionLostException:
    connection lost: WSAECONNRESET
    local address = 10.65.60.91:59972
    remote address = 10.65.81.6:49208
    -! 10/28/11 06:35:00.410 DfsService.server.1: warning: connection exception:
    TcpTransceiver.cpp:163: Ice::ConnectionLostException:
    connection lost: WSAECONNRESET
    0.0.0.0:59972
    -! 10/28/11 07:02:50.869 DfsService.server.1: warning: connection exception:
    ConnectionI.cpp:1250: Ice::BadMagicException:
    unknown magic number: 0xb0, 0x6e, 0x42, 0x48
    local address = 10.65.60.91:55559
    remote address = 10.65.60.91:59922
    -! 10/28/11 07:20:37.572 DfsService.server.1: warning: connection exception:
    Outgoing.cpp:226: Ice::TimeoutException:
    timeout while sending or receiving data
    local address = 10.65.60.91:55553
    remote address = 10.65.60.76:51668


    Maybe it's time for me to start commenting out source code and finally get a small test reproducing this errors.

    Regards, andy
  • Hi, all!

    Today we turned off all communications between DfsService and DfsAgents and left only DfsService and FileSource (aka Coordinator), they are on the same host. And we've got another error in logs!

    FileSource (aka Coordinator):

    -! 10/28/11 15:32:16.786 Coordinator.server.1: warning: connection exception:
    TcpTransceiver.cpp:169: Ice::SocketException:
    socket exception: WSAEFAULT
    local address = 10.65.60.91:65228
    remote address = 10.65.60.91:50843


    DfsService:

    -! 10/28/11 15:32:16.786 DfsService.server.1: warning: connection exception:
    TcpTransceiver.cpp:414: Ice::ConnectionLostException:
    connection lost: The specified network name is no longer available.
    local address = 10.65.60.91:50843
    remote address = 10.65.60.91:65228


    Can anyone explain what's happening?

    Regards, andy
  • bernard
    bernard Jupiter, FL
    Hi Andy,

    It would be great if you could reproduce this problem in a test-case that we can try out.

    The BadMagicException and WSAEFAULT sound like a memory corruption, probably in the FileSource/Coordinator server (i.e. the faulty FileSource sends an incorrect magic to the DfsService, and occasionally fails with WSAEFAULT).

    Did you try to run your servers with a memory checker like Purify?

    Best regards,
    Bernard
  • Hi, Bernard!

    Thanks for an idea, we'll try to do this on Monday.

    Another thing we noticed about Coordinator - it warns several times a second that thread pool is running low on threads. Adapter thread pool size is set to 32. Can this intensive workload cause any errors? No such warnings could be found in DfsService log.

    Regards, andy
  • Hi, all!

    Unfortunately, I haven't examined BadMagicException in Coordinator yet :(

    Rather, I collected ICE traces when all of 16 worker threads in DfsService got stuck in synchronous calls to a single DfsAgent and then failed with a TimeoutException. What surprised me - I couldn't find in DfsAgent's traces that it received this calls! Please, explain, how this can happen!!

    DfsAgent:

    -- 11/08/11 01:55:17.546 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324610
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:17.546 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:17.591 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324611
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:17.591 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:17.675 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324612
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:17.675 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:17.934 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324613
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:17.934 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:18.152 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324614
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:18.152 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:20.733 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324615
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:20.733 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:20.750 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324616
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:20.750 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:20.975 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324617
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:20.975 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:21.113 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324618
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:21.113 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:21.440 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324619
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:21.440 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:21.441 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324620
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:21.441 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:21.897 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324621
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:21.897 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:22.352 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324622
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:22.352 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:22.605 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324623
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:22.605 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:24.115 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 86
    request id = 1324624
    identity = DfsAgent.8
    facet =
    operation = openWrite
    mode = 0 (normal)
    context =
    -- 11/08/11 01:55:24.115 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092

    -- 11/08/11 01:55:40.932 DfsService.server.1: Protocol: sending request
    message type = 0 (request)
    compression status = 0 (not compressed; do not compress response, if any)
    message size = 53
    request id = 1324625
    identity = DfsAgent.8
    facet =
    operation = getAgentState
    mode = 2 (idempotent)
    context =
    -- 11/08/11 01:55:40.932 DfsService.server.1: Network: sent 53 of 53 bytes via tcp
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092


    The very moment, when we got TimeoutException:

    -! 11/08/11 01:57:17.437 DfsService.server.1: warning: connection exception:
    Outgoing.cpp:226: Ice::TimeoutException:
    timeout while sending or receiving data
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092
    -- 11/08/11 01:57:17.437 DfsService.server.1: Network: closing tcp connection
    local address = 10.65.60.91:64591
    remote address = 10.65.60.86:61092


    None of the received requests for the bolded request id's were found in ICE traces of DfsAgent.8 server.

    Regards, andy
  • mes
    mes California
    Hi Andy,

    Your traces don't show any successful request/reply exchanges with DfsAgent.8. Was it working properly before the timeout occurred?

    In a situation like this where multiple pending requests to the same server result in a timeout, the reason is often that the server has deadlocked. Have you tried attaching to the server in a debugger to examine what all of its threads are doing?

    Regarding the missing trace output in the server: Is the server creating multiple communicators? If so, you may not be enabling the tracing properties in the correct communicator. Is the server managed by IceGrid? If so, have you confirmed that the property changes have been deployed to the server's configuration?

    Regards,
    Mark
  • Hi, Mark!

    Of course, DfsAgent.8 works properly most of the time, it receives/replies to requests of DfsService successfully, traces show it, I just haven't pasted them here. Such timeouts appear every two-three hours or so. No, I haven't try to debug since can't predict when it happens - timestamps show it was a deep night yesterday :)

    Yes, all of the servers - FileSource(Coordinator), DfsService and DfsAgents are managed by IceGrid, all of them have exact one communicator - and all properties are set correctly.

    Just one thing I thought now - all servers have single adapter and we set Adapter.ThreadPool.Size and SizeMax to 16, leaving communicators' server and client threadpools to have only one thread. Maybe this is the problem?

    To clarify - when our working thread makes a synchronous call to DfsAgent - who receives the result of the operation - some thread from the DfsService client threadpool (thus passing result to worker thread and awaking it) or worker thread itself?

    Another question - if all threads from server threadpool are busy and one more request arrives - should we see in traces smth like 'received request' immediately? Or trace will be delayed until any of server threads becomes idle and starts to process this request?

    Regards, andy
  • mes
    mes California
    andy_84 wrote: »
    Yes, all of the servers - FileSource(Coordinator), DfsService and DfsAgents are managed by IceGrid, all of them have exact one communicator - and all properties are set correctly.
    In that case I would need to see the configuration file for DfsAgent (and possibly the code) to determine why the tracing properties are not having any effect. Other possibilities that come to mind: the properties are spelled incorrectly, or the log output is being redirected.
    Just one thing I thought now - all servers have single adapter and we set Adapter.ThreadPool.Size and SizeMax to 16, leaving communicators' server and client threadpools to have only one thread. Maybe this is the problem?

    To clarify - when our working thread makes a synchronous call to DfsAgent - who receives the result of the operation - some thread from the DfsService client threadpool (thus passing result to worker thread and awaking it) or worker thread itself?
    For an outgoing connection (such as from DfsService to DfsAgent), threads in the communicator's client thread pool are responsible for reading and processing reply messages. For a synchronous invocation, the thread pool thread wakes up the application thread that is blocked waiting for the invocation to complete.

    Normally, it is sufficient to use the default configuration of one thread for the client thread pool. However, if your client is making asynchronous invocations, and the client's asynchronous callbacks are making nested invocations, then it may be necessary to increase the size of the client thread pool.
    Another question - if all threads from server threadpool are busy and one more request arrives - should we see in traces smth like 'received request' immediately? Or trace will be delayed until any of server threads becomes idle and starts to process this request?
    The trace will be delayed until an idle thread processes the request.

    Regards,
    Mark
  • In that case I would need to see the configuration file for DfsAgent (and possibly the code) to determine why the tracing properties are not having any effect
    Mark, trace properties on DfsAgents do have effect :) We observe all requests, replies and sending/receiving bytes. These are properties set for Coordinator, DfsService and DfsAgents as we see them in IceGrid Admin GUI:

    Ice.Warn.Connections=1
    Ice.Trace.Protocol=1
    Ice.Trace.Network=3

    I feel our team is very close to find a deadlock in our system :) It seems to be easier than to resolve BadMagicException.

    Thanks for answers, I'll post the results of investigation.

    Regards, andy
  • Hi, all!

    Today we've removed Ice.Override.Timeout from DfsService (it was previously set to 2 minutes), and got the following situation.

    All of 16 worker threads in DfsService again got stuck in calls to one DfsAgent, WinDbg shows they are waiting on ice34!IceInternal::Outgoing::invoke+0x205 - if we are right, that's the code:

    else
    {
    _monitor.wait();
    }

    These calls didn't reach DfsAgent - we have our logging in servant. But! DfsAgent succesfully processes incoming requests from another apps of our system, server threadpool isn't busy! It is also confirmed by Ice.Trace.ThreadPool=1 - traces show that DfsAgent has enough idle threads.

    How this can happen?

    Regards, andy
  • mes
    mes California
    I still suspect a deadlock (or thread starvation) situation. If this was happening in my application, I would be examining the stack trace of every thread in both DfsService and DfsAgent. If you can provide these stack traces, I'd be happy to review them.

    Regards,
    Mark
  • Hi, all!

    At last we created a simple multithreaded client server test application based on provided demo/Ice/hello example.

    Both client and server are running on two computers A and B connected with 1Gb network.

    Client connects both servers: local and remote and calls a simple interface simulating copying chunks of data.

    Sample was built by Visual Studio 2010 Ultimate SP1 and ICE 3.4.2 (using binaries from zeroc.com) and executed on Windows Server 2008 R2.

    (Prebuilt server.exe and client.exe also provided)

    Here's a log from the server with bad-magic exception:

    -! 12/14/11 11:29:02.734 server: warning: connection exception:
    ConnectionI.cpp:1250: Ice::BadMagicException:
    unknown magic number: 0x00, 0x00, 0x00, 0x00
    local address = 10.65.62.247:10000
    remote address = 10.65.62.247:64947
    -- 12/14/11 11:29:02.734 server: Network: closing tcp connection
    local address = 10.65.62.247:10000
    remote address = 10.65.62.247:64947
    -- 12/14/11 11:29:02.766 server: Network: accepted tcp connection
    local address = 10.65.62.247:10000
    remote address = 10.65.62.247:65021
    -- 12/14/11 11:29:41.283 server: ThreadPool: shrinking Ice.ThreadPool.Server: Size = 11


    The result is even more confusing because according to the log BadMagicException occurs when the client calls its local server.
    Note: we couldn't reproduce this using only one instance of client and server

    Please, take a look at the sample Attachment not found.

    Regards, andy
  • Hi, all!

    Another thing I want to mention - we started this test on our workstation PCs (2-core, running Windows 7) - after half a day we saw BadMagicException!
    But on 16-core servers this exception occures more quick - as if some concurrency issue exists in Ice...

    Regards, andy
  • mes
    mes California
    Hi Andy,

    Thanks for the sample project, we'll take a look at it.

    Regards,
    Mark
  • benoit
    benoit Rennes, France
    Hi Andrew,

    So far I was unable to reproduce between 2 quad core Windows 7 machines over a 1GB Ethernet connection. I didn't run the tests that long however... I will let them run and hopefully I can reproduce it.

    Cheers,
    Benoit.
  • Hi, all! Another news!

    If we call function 'write' with buffer of 0's - we get BadMagicException with 0's, and if we fill buffer with AAAA's - we get this:
    -! 12/19/11 14:19:50.729 server: warning: connection exception:
    ConnectionI.cpp:1250: Ice::BadMagicException:
    unknown magic number: 0xaa, 0xaa, 0xaa, 0xaa
    local address = 10.65.62.247:10000
    remote address = 10.65.62.247:58116
    -- 12/19/11 14:19:50.729 server: Network: closing tcp connection
    local address = 10.65.62.247:10000
    remote address = 10.65.62.247:58116
    -- 12/19/11 14:19:50.729 server: Network: accepted tcp connection
    local address = 10.65.62.247:10000
    remote address = 10.65.62.247:58263


    Maybe Ice has a bug?

    Regards, andy
  • benoit
    benoit Rennes, France
    Hi Andy,

    So far I've been unable to reproduce the problem with your test case. Is there something specific in your network environment? Did you try with 2 PCs connected through an Ethernet cable?

    Cheers,
    Benoit.
  • Hello.

    Thanks for trying our test.

    How long did you let it run?

    According to our tests 'bad magic' occurs at average once an hour at our server nodes (2x Quad CPUs) and we saw it only once at our workstations (Core2Duo CPU).

    This difference might be due to a synchronization issue in current version of ICE.

    Although we didn't try the test with 2 Ethernet cable connected PC's but we get this errors in several different networks (different datacenters).

    And also, as my previous log suggests, 'bad magic(s)' occur even between client and server running the same machine.

    Yep, and we've just got this error on our local-only test (2 servers and 2 clients on a single server):

    -! 12/20/11 13:29:24.767 server: warning: connection exception:
    ConnectionI.cpp:1250: Ice::BadMagicException:
    unknown magic number: 0xaa, 0xaa, 0xaa, 0xaa
    local address = 10.65.62.247:10001
    remote address = 10.65.62.247:65416


    Besides the fact that we always get in 'magic' the same bytes that we transmit in the payload confirms that it doesn't look like network packet corruption.

    Regards, andy
  • benoit
    benoit Rennes, France
    Hi Andy,

    I was so far trying to reproduce with a debug version, I'm now going to try to reproduce with a release version. Did you try with a debug version? Perhaps you might get some asserts which could be useful to figure out the problem.

    Otherwise, here are few things that might be worth trying and that would help narrowing down the origin of the problem:
    • Does it still fail if you remove the creation of the Ice objects for transferring the data but instead simply use a single Ice object for the writes?
    • Does it still fail if you reduce the server thread pool size to 1 thread (the default if you don't set Ice.ThreadPool.Server.Size)?
    • Does it still fail if the client uses separate connections for each of the threads? (you can pass a proxy that uses a unique connection to each of the thread with the proxy ice_connectionId method).

    This would help to figure out whether or not the problem comes from the client or server side of the Ice runtime.

    Cheers,
    Benoit.
  • Hi, Benoit!

    We've tried all of test cases you suggested. Here are the results:

    1. Debug build: BadMagic reproduced
    2. Excluding creation of ice-objects (added testWrite method to IHello): BadMagic reproduced
    3. One server thread: still running without any errors (not reproduced)
    4. Separate connection for each thread (with ice_connectionId): BadMagic reproduced

    Your turn :rolleyes:

    Regards, andy
  • benoit
    benoit Rennes, France
    Hi Andy,

    Unfortunately, I have not been able to reproduce the issue. The fact that it doesn't show up on a single threaded server indicates it's probably a server side issue. I've attached a patch to this email that adds some additional asserts to the code. Could you patch your Ice distribution with this patch and see if the problem still occurs with a debug build and if one of those asserts is hit? To apply the patch to a fresh Ice 3.4.2 source distribution:
       > cd Ice-3.4.2
       > patch -p1 < patch.txt
    

    If an assert is triggered, it would be helpful if you could provide the values of the connection attributes (the Ice::ConnectionI object and its IceInternal::EventHandler base class) at the time of the assert.

    You could also try to see if the problem still shows up when using multiple threads on the server side and specifying the Ice.ThreadPool.Server.Serialize=1 property.

    Cheers,
    Benoit.
  • Hi, Benoit!
    You could also try to see if the problem still shows up when using multiple threads on the server side and specifying the Ice.ThreadPool.Server.Serialize=1 property.

    BadMagicException reproduced.
    Could you patch your Ice distribution with this patch and see if the problem still occurs with a debug build and if one of those asserts is hit?

    We applied your patch, built Debug configuration - BadMagicException reproduced without any asserts. :confused:

    Regards, andy