BadMagicException on single server

andy_84 · October 2011

Hi all!

1. We are using Ice 3.4.2 binaries, cluster of Windows Server 2008 R2 Enterprise machines. We have one server App with 16 worker threads each transferring (synchronously) files to Clients.
This produces a 100% load over 1Gigabit adapter (as we see in Task Manager). Every time we run our scenario we see this errors in log:

-! 10/26/11 11:20:42.139 DfsService.server.1: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0x05, 0x83, 0x11, 0x4e
local address = 10.65.60.91:50505
remote address = 10.65.60.91:49239

Notice - BadMagicException is on the single server.

Also regularly we see another warnings from Ice:

-! 10/26/11 11:23:45.734 DfsService.server.1: warning: connection exception:
TcpTransceiver.cpp:414: Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
local address = 10.65.60.91:50509
remote address = 10.65.60.92:59634

and

-! 10/26/11 11:38:02.272 DfsService.server.1: warning: connection exception:
Outgoing.cpp:226: Ice::TimeoutException:
timeout while sending or receiving data
local address = 10.65.60.91:51182
remote address = 10.65.60.92:59634

(we overrided timeout with 2 minutes for the server App).

We talked to our admins - they say everything is OK with hardware, though we think it is not.

Do you have any ideas what can cause this errors? Especially BadMagicException, which seems very, very strange to us.

2. Another question - we downloaded Ice 3.4.2 binaries for MSVC 2010, while our apps are built with MSVC 2010 SP1 - is this acceptable?

Thanks.

mes · October 2011

Hi Andy,

I agree that the BadMagicException is the most serious issue. Does this only happen when the system is running at 100% CPU?

This exception indicates that Ice has received an invalid packet on one of its connections. The most likely reason is a non-Ice program mistakenly attempting to connect and communicate with an Ice endpoint. It could also be caused by a concurrency bug in Ice.

-! 10/26/11 11:20:42.139 DfsService.server.1: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0x05, 0x83, 0x11, 0x4e
local address = 10.65.60.91:50505
remote address = 10.65.60.91:49239

The key piece of information here is the remote address. The first step is to verify whether it is an Ice program or a non-Ice program at the remote end. Assuming it's an Ice program, your next step should be to enable the following properties in both programs:

Ice.Trace.Network=3
Ice.Trace.Protocol=1

Review the logs and see if anything grabs your attention, and feel free to post them here and we'll take a look. Also, if you can create a small test case that reproduces the exception, we'll gladly try that too.

2. Another question - we downloaded Ice 3.4.2 binaries for MSVC 2010, while our apps are built with MSVC 2010 SP1 - is this acceptable?

I built our C++ test suite with VS 2010SP1 while linking with the VS 2010 libraries from the 3.4.2 binary distribution. All the tests ran successfully, so I don't expect there will be any compatibility issues.

Regards,
Mark

andy_84 · October 2011

Hi, Mark, thanks for answer!

Does this only happen when the system is running at 100% CPU?

Not the CPU, but 100% network load. We run our scenario approx. 30 times - high load was the first symptom we noticed.

Let me tell more details about our system.
We are using IceGrid containing 20 nodes. First node contains two servers, let's call them
FileSource and DfsService. Other nodes contain one DfsAgent per node.
As I mentioned, DfsService has 16 threads, each does a very simple thing - obtains proxy to a file from FileSource, then reads bytes from this proxy and writes these chunks to some DfsAgent.

It turned out that BadMagicException appears when DfsService reads through file proxy bytes from FileSource.
Other exceptions from my first post happen between DfsService and DfsAgents when writing bytes.

Another thing - we deployed our system on two independent clusters - both discovered this behaviour.

I try to write a small test case asap, also turn on Ice traces and review the logs.

Upd.
I've found similar post here http://www.zeroc.com/forums/help-center/5456-badmagicexception-crash-tcp-lost-data.html, but with no answers

Regards,
andy

linkman · October 2011

It is me

I posted "BadMagicException and Crash, Is TCP lost data?" 5 months ago.
I haven used CommView(like WireShark, but can use in single computer) capture some network packets, and found more details, if you want the packets file, i will upload sometimes.

andy_84 · October 2011

Hi!
Caught again, here are traces.

DfsService:

-- 10/27/11 14:03:29.480 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
local address = 10.65.60.91:57249
remote address = 10.65.60.71:51148
-- 10/27/11 14:03:29.480 DfsService.server.1: Network: received 1326 of 1326 bytes via tcp
local address = 10.65.60.91:57609
remote address = 10.65.60.91:57130
-- 10/27/11 14:03:29.480 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
local address = 10.65.60.91:57256
remote address = 10.65.60.78:65416
-- 10/27/11 14:03:29.480 DfsService.server.1: Protocol: received reply
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 524318
request id = 199018
reply status = 0 (ok)
-- 10/27/11 14:03:29.480 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
local address = 10.65.60.91:57244
remote address = 10.65.60.87:65036
-- 10/27/11 14:03:29.481 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
local address = 10.65.60.91:57256
remote address = 10.65.60.78:65416
-- 10/27/11 14:03:29.481 DfsService.server.1: Network: received 14 of 14 bytes via tcp
local address = 10.65.60.91:57609
remote address = 10.65.60.91:57130
-! 10/27/11 14:03:29.481 DfsService.server.1: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0x10, 0x4f, 0x00, 0x0f
local address = 10.65.60.91:57609
remote address = 10.65.60.91:57130
-- 10/27/11 14:03:29.481 DfsService.server.1: Network: closing tcp connection
local address = 10.65.60.91:57609
remote address = 10.65.60.91:57130
-- 10/27/11 14:03:29.482 DfsService.server.1: Network: sent 65536 of 65536 bytes via tcp
local address = 10.65.60.91:57249
remote address = 10.65.60.71:51148
-- 10/27/11 14:03:29.482 DfsService.server.1: Network: sent 84 of 84 bytes via tcp
local address = 10.65.60.91:57249
remote address = 10.65.60.71:51148
-- 10/27/11 14:03:29.482 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 87
request id = 23901
identity = DfsAgent.13
facet =
operation = openWrite
mode = 0 (normal)
context =

FileSource:

-- 10/27/11 14:03:29.479 Coordinator.server.1: Protocol: received request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 199019
identity = 304d6087-311a-4cd2-9761-4bb97849dd5c
facet =
operation = read
mode = 0 (normal)
context =
-- 10/27/11 14:03:29.479 Coordinator.server.1: Network: sent 65536 of 65536 bytes via tcp
local address = 10.65.60.91:57130
remote address = 10.65.60.91:57609
-- 10/27/11 14:03:29.480 Coordinator.server.1: Protocol: sending reply
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 524318
request id = 199010
reply status = 0 (ok)
-! 10/27/11 14:03:29.485 Coordinator.server.1: warning: connection exception:
TcpTransceiver.cpp:341: Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
local address = 10.65.60.91:57130
remote address = 10.65.60.91:57609
-- 10/27/11 14:03:29.485 Coordinator.server.1: Network: closing tcp connection
local address = 10.65.60.91:57130
remote address = 10.65.60.91:57609
-- 10/27/11 14:03:29.485 Coordinator.server.1: Network: accepted tcp connection
local address = 10.65.60.91:57130
remote address = 10.65.60.91:58377
-- 10/27/11 14:03:29.485 Coordinator.server.1: Protocol: sending validate connection
message type = 3 (validate connection)
compression status = 0 (not compressed; do not compress response, if any)
message size = 14
-- 10/27/11 14:03:29.485 Coordinator.server.1: Network: sent 14 of 14 bytes via tcp
local address = 10.65.60.91:57130
remote address = 10.65.60.91:58377
-- 10/27/11 14:03:29.486 Coordinator.server.1: Network: received 14 of 14 bytes via tcp
local address = 10.65.60.91:57130
remote address = 10.65.60.91:58377
-- 10/27/11 14:03:29.486 Coordinator.server.1: Network: received 72 of 72 bytes via tcp
local address = 10.65.60.91:57130
remote address = 10.65.60.91:58377
-- 10/27/11 14:03:29.486 Coordinator.server.1: Protocol: received request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1
identity = 14a071e1-cb2b-45d3-b7f6-bb4b209b3b06
facet =
operation = read
mode = 0 (normal)
context =

Still trying to make a small test program reproducing BadMagicException.

Regards, andy

mes · October 2011

Andy,

Thanks for the log output.

The ConnectionLostException is definitely expected in this situation. The sender receives this exception because the receiver ungracefully closes its end of the connection after detecting the "bad magic" condition.

Can you check the Windows event log on this host to see whether any network-related events were logged around the time of the BadMagicException?

Regards,
Mark

mes · October 2011

Andy,

One other request: please review the discussion in this thread. Your situation sounds quite similar to the one described there.

Regards,
Mark

andy_84 · October 2011

mes wrote: »

Andy,

One other request: please review the discussion in this thread. Your situation sounds quite similar to the one described there.

Regards,
Mark

Hi, Mark!
Funny thing happened - this post was created by my colleague as we started to develop the first version of our system

Unfortunately, he left our team, but we still turn off all TCP offload options on new servers we configure - so this is not the case

Regards, andy

linkman · October 2011

you can try to limit the tcp packet size less than 65535

just to try, split one lage ICE invoke to multi invokes.

andy_84 · October 2011

Hi, Mark!

mes wrote: »

Can you check the Windows event log on this host to see whether any network-related events were logged around the time of the BadMagicException?

No, nothing strange in Event log. System was working the whole night, new error WSAECONNRESET appeared in log:

-! 10/28/11 04:55:29.600 DfsService.server.1: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0x80, 0xfa, 0x00, 0x00
local address = 10.65.60.91:51352
remote address = 10.65.60.91:59922
-! 10/28/11 05:00:53.428 DfsService.server.1: warning: connection exception:
TcpTransceiver.cpp:414: Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
local address = 10.65.60.91:54077
remote address = 10.65.60.91:59922
-! 10/28/11 06:17:34.248 DfsService.server.1: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0x85, 0xc0, 0x74, 0x6a
local address = 10.65.60.91:54165
remote address = 10.65.60.91:59922
-! 10/28/11 06:35:00.397 DfsService.server.1: warning: connection exception:
Outgoing.cpp:226: Ice::TimeoutException:
timeout while sending or receiving data
local address = 10.65.60.91:60950
remote address = 10.65.60.74:53596
-! 10/28/11 06:35:00.404 DfsService.server.1: warning: connection exception:
TcpTransceiver.cpp:163: Ice::ConnectionLostException:
connection lost: WSAECONNRESET
local address = 10.65.60.91:59972
remote address = 10.65.81.6:49208
-! 10/28/11 06:35:00.410 DfsService.server.1: warning: connection exception:
TcpTransceiver.cpp:163: Ice::ConnectionLostException:
connection lost: WSAECONNRESET
0.0.0.0:59972
-! 10/28/11 07:02:50.869 DfsService.server.1: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0xb0, 0x6e, 0x42, 0x48
local address = 10.65.60.91:55559
remote address = 10.65.60.91:59922
-! 10/28/11 07:20:37.572 DfsService.server.1: warning: connection exception:
Outgoing.cpp:226: Ice::TimeoutException:
timeout while sending or receiving data
local address = 10.65.60.91:55553
remote address = 10.65.60.76:51668

Maybe it's time for me to start commenting out source code and finally get a small test reproducing this errors.

Regards, andy

andy_84 · October 2011

Hi, all!

Today we turned off all communications between DfsService and DfsAgents and left only DfsService and FileSource (aka Coordinator), they are on the same host. And we've got another error in logs!

FileSource (aka Coordinator):

-! 10/28/11 15:32:16.786 Coordinator.server.1: warning: connection exception:
TcpTransceiver.cpp:169: Ice::SocketException:
socket exception: WSAEFAULT
local address = 10.65.60.91:65228
remote address = 10.65.60.91:50843

DfsService:

-! 10/28/11 15:32:16.786 DfsService.server.1: warning: connection exception:
TcpTransceiver.cpp:414: Ice::ConnectionLostException:
connection lost: The specified network name is no longer available.
local address = 10.65.60.91:50843
remote address = 10.65.60.91:65228

Can anyone explain what's happening?

Regards, andy

bernard · October 2011

Hi Andy,

It would be great if you could reproduce this problem in a test-case that we can try out.

The BadMagicException and WSAEFAULT sound like a memory corruption, probably in the FileSource/Coordinator server (i.e. the faulty FileSource sends an incorrect magic to the DfsService, and occasionally fails with WSAEFAULT).

Did you try to run your servers with a memory checker like Purify?

Best regards,
Bernard

andy_84 · October 2011

Hi, Bernard!

Thanks for an idea, we'll try to do this on Monday.

Another thing we noticed about Coordinator - it warns several times a second that thread pool is running low on threads. Adapter thread pool size is set to 32. Can this intensive workload cause any errors? No such warnings could be found in DfsService log.

Regards, andy

andy_84 · November 2011

Hi, all!

Unfortunately, I haven't examined BadMagicException in Coordinator yet

Rather, I collected ICE traces when all of 16 worker threads in DfsService got stuck in synchronous calls to a single DfsAgent and then failed with a TimeoutException. What surprised me - I couldn't find in DfsAgent's traces that it received this calls! Please, explain, how this can happen!!

DfsAgent:

-- 11/08/11 01:55:17.546 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324610
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:17.546 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:17.591 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324611
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:17.591 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:17.675 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324612
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:17.675 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:17.934 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324613
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:17.934 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:18.152 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324614
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:18.152 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:20.733 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324615
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:20.733 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:20.750 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324616
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:20.750 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:20.975 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324617
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:20.975 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:21.113 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324618
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:21.113 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:21.440 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324619
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:21.440 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:21.441 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324620
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:21.441 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:21.897 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324621
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:21.897 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:22.352 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324622
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:22.352 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:22.605 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324623
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:22.605 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:24.115 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 86
request id = 1324624
identity = DfsAgent.8
facet =
operation = openWrite
mode = 0 (normal)
context =
-- 11/08/11 01:55:24.115 DfsService.server.1: Network: sent 86 of 86 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

-- 11/08/11 01:55:40.932 DfsService.server.1: Protocol: sending request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 53
request id = 1324625
identity = DfsAgent.8
facet =
operation = getAgentState
mode = 2 (idempotent)
context =
-- 11/08/11 01:55:40.932 DfsService.server.1: Network: sent 53 of 53 bytes via tcp
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

The very moment, when we got TimeoutException:

-! 11/08/11 01:57:17.437 DfsService.server.1: warning: connection exception:
Outgoing.cpp:226: Ice::TimeoutException:
timeout while sending or receiving data
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092
-- 11/08/11 01:57:17.437 DfsService.server.1: Network: closing tcp connection
local address = 10.65.60.91:64591
remote address = 10.65.60.86:61092

None of the received requests for the bolded request id's were found in ICE traces of DfsAgent.8 server.

Regards, andy

mes · November 2011

Hi Andy,

Your traces don't show any successful request/reply exchanges with DfsAgent.8. Was it working properly before the timeout occurred?

In a situation like this where multiple pending requests to the same server result in a timeout, the reason is often that the server has deadlocked. Have you tried attaching to the server in a debugger to examine what all of its threads are doing?

Regarding the missing trace output in the server: Is the server creating multiple communicators? If so, you may not be enabling the tracing properties in the correct communicator. Is the server managed by IceGrid? If so, have you confirmed that the property changes have been deployed to the server's configuration?

Regards,
Mark

andy_84 · November 2011

Hi, Mark!

Of course, DfsAgent.8 works properly most of the time, it receives/replies to requests of DfsService successfully, traces show it, I just haven't pasted them here. Such timeouts appear every two-three hours or so. No, I haven't try to debug since can't predict when it happens - timestamps show it was a deep night yesterday

Yes, all of the servers - FileSource(Coordinator), DfsService and DfsAgents are managed by IceGrid, all of them have exact one communicator - and all properties are set correctly.

Just one thing I thought now - all servers have single adapter and we set Adapter.ThreadPool.Size and SizeMax to 16, leaving communicators' server and client threadpools to have only one thread. Maybe this is the problem?

To clarify - when our working thread makes a synchronous call to DfsAgent - who receives the result of the operation - some thread from the DfsService client threadpool (thus passing result to worker thread and awaking it) or worker thread itself?

Another question - if all threads from server threadpool are busy and one more request arrives - should we see in traces smth like 'received request' immediately? Or trace will be delayed until any of server threads becomes idle and starts to process this request?

Regards, andy

mes · November 2011

andy_84 wrote: »

Yes, all of the servers - FileSource(Coordinator), DfsService and DfsAgents are managed by IceGrid, all of them have exact one communicator - and all properties are set correctly.

In that case I would need to see the configuration file for DfsAgent (and possibly the code) to determine why the tracing properties are not having any effect. Other possibilities that come to mind: the properties are spelled incorrectly, or the log output is being redirected.

Just one thing I thought now - all servers have single adapter and we set Adapter.ThreadPool.Size and SizeMax to 16, leaving communicators' server and client threadpools to have only one thread. Maybe this is the problem?

To clarify - when our working thread makes a synchronous call to DfsAgent - who receives the result of the operation - some thread from the DfsService client threadpool (thus passing result to worker thread and awaking it) or worker thread itself?

For an outgoing connection (such as from DfsService to DfsAgent), threads in the communicator's client thread pool are responsible for reading and processing reply messages. For a synchronous invocation, the thread pool thread wakes up the application thread that is blocked waiting for the invocation to complete.

Normally, it is sufficient to use the default configuration of one thread for the client thread pool. However, if your client is making asynchronous invocations, and the client's asynchronous callbacks are making nested invocations, then it may be necessary to increase the size of the client thread pool.

Another question - if all threads from server threadpool are busy and one more request arrives - should we see in traces smth like 'received request' immediately? Or trace will be delayed until any of server threads becomes idle and starts to process this request?

The trace will be delayed until an idle thread processes the request.

Regards,
Mark

andy_84 · November 2011

In that case I would need to see the configuration file for DfsAgent (and possibly the code) to determine why the tracing properties are not having any effect

Mark, trace properties on DfsAgents do have effect

We observe all requests, replies and sending/receiving bytes. These are properties set for Coordinator, DfsService and DfsAgents as we see them in IceGrid Admin GUI:

Ice.Warn.Connections=1
Ice.Trace.Protocol=1
Ice.Trace.Network=3

I feel our team is very close to find a deadlock in our system

It seems to be easier than to resolve BadMagicException.

Thanks for answers, I'll post the results of investigation.

Regards, andy

andy_84 · November 2011

Hi, all!

Today we've removed Ice.Override.Timeout from DfsService (it was previously set to 2 minutes), and got the following situation.

All of 16 worker threads in DfsService again got stuck in calls to one DfsAgent, WinDbg shows they are waiting on ice34!IceInternal::Outgoing::invoke+0x205 - if we are right, that's the code:

else
{
_monitor.wait();
}

These calls didn't reach DfsAgent - we have our logging in servant. But! DfsAgent succesfully processes incoming requests from another apps of our system, server threadpool isn't busy! It is also confirmed by Ice.Trace.ThreadPool=1 - traces show that DfsAgent has enough idle threads.

How this can happen?

Regards, andy

mes · November 2011

I still suspect a deadlock (or thread starvation) situation. If this was happening in my application, I would be examining the stack trace of every thread in both DfsService and DfsAgent. If you can provide these stack traces, I'd be happy to review them.

Regards,
Mark

andy_84 · December 2011

Hi, all!

At last we created a simple multithreaded client server test application based on provided demo/Ice/hello example.

Both client and server are running on two computers A and B connected with 1Gb network.

Client connects both servers: local and remote and calls a simple interface simulating copying chunks of data.

Sample was built by Visual Studio 2010 Ultimate SP1 and ICE 3.4.2 (using binaries from zeroc.com) and executed on Windows Server 2008 R2.

(Prebuilt server.exe and client.exe also provided)

Here's a log from the server with bad-magic exception:

-! 12/14/11 11:29:02.734 server: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0x00, 0x00, 0x00, 0x00
local address = 10.65.62.247:10000
remote address = 10.65.62.247:64947
-- 12/14/11 11:29:02.734 server: Network: closing tcp connection
local address = 10.65.62.247:10000
remote address = 10.65.62.247:64947
-- 12/14/11 11:29:02.766 server: Network: accepted tcp connection
local address = 10.65.62.247:10000
remote address = 10.65.62.247:65021
-- 12/14/11 11:29:41.283 server: ThreadPool: shrinking Ice.ThreadPool.Server: Size = 11

The result is even more confusing because according to the log BadMagicException occurs when the client calls its local server.
Note: we couldn't reproduce this using only one instance of client and server

Please, take a look at the sample Attachment not found.

Regards, andy

andy_84 · December 2011

Hi, all!

Another thing I want to mention - we started this test on our workstation PCs (2-core, running Windows 7) - after half a day we saw BadMagicException!
But on 16-core servers this exception occures more quick - as if some concurrency issue exists in Ice...

Regards, andy

mes · December 2011

Hi Andy,

Thanks for the sample project, we'll take a look at it.

Regards,
Mark

benoit · December 2011

Hi Andrew,

So far I was unable to reproduce between 2 quad core Windows 7 machines over a 1GB Ethernet connection. I didn't run the tests that long however... I will let them run and hopefully I can reproduce it.

Cheers,
Benoit.

andy_84 · December 2011

Hi, all! Another news!

If we call function 'write' with buffer of 0's - we get BadMagicException with 0's, and if we fill buffer with AAAA's - we get this:

-! 12/19/11 14:19:50.729 server: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0xaa, 0xaa, 0xaa, 0xaa
local address = 10.65.62.247:10000
remote address = 10.65.62.247:58116
-- 12/19/11 14:19:50.729 server: Network: closing tcp connection
local address = 10.65.62.247:10000
remote address = 10.65.62.247:58116
-- 12/19/11 14:19:50.729 server: Network: accepted tcp connection
local address = 10.65.62.247:10000
remote address = 10.65.62.247:58263

Maybe Ice has a bug?

Regards, andy

benoit · December 2011

Hi Andy,

So far I've been unable to reproduce the problem with your test case. Is there something specific in your network environment? Did you try with 2 PCs connected through an Ethernet cable?

Cheers,
Benoit.

andy_84 · December 2011

Hello.

Thanks for trying our test.

How long did you let it run?

According to our tests 'bad magic' occurs at average once an hour at our server nodes (2x Quad CPUs) and we saw it only once at our workstations (Core2Duo CPU).

This difference might be due to a synchronization issue in current version of ICE.

Although we didn't try the test with 2 Ethernet cable connected PC's but we get this errors in several different networks (different datacenters).

And also, as my previous log suggests, 'bad magic(s)' occur even between client and server running the same machine.

Yep, and we've just got this error on our local-only test (2 servers and 2 clients on a single server):

-! 12/20/11 13:29:24.767 server: warning: connection exception:
ConnectionI.cpp:1250: Ice::BadMagicException:
unknown magic number: 0xaa, 0xaa, 0xaa, 0xaa
local address = 10.65.62.247:10001
remote address = 10.65.62.247:65416

Besides the fact that we always get in 'magic' the same bytes that we transmit in the payload confirms that it doesn't look like network packet corruption.

Regards, andy

benoit · December 2011

Hi Andy,

I was so far trying to reproduce with a debug version, I'm now going to try to reproduce with a release version. Did you try with a debug version? Perhaps you might get some asserts which could be useful to figure out the problem.

Otherwise, here are few things that might be worth trying and that would help narrowing down the origin of the problem:

Does it still fail if you remove the creation of the Ice objects for transferring the data but instead simply use a single Ice object for the writes?
Does it still fail if you reduce the server thread pool size to 1 thread (the default if you don't set Ice.ThreadPool.Server.Size)?
Does it still fail if the client uses separate connections for each of the threads? (you can pass a proxy that uses a unique connection to each of the thread with the proxy ice_connectionId method).

This would help to figure out whether or not the problem comes from the client or server side of the Ice runtime.

Cheers,
Benoit.

andy_84 · December 2011

Hi, Benoit!

We've tried all of test cases you suggested. Here are the results:

1. Debug build: BadMagic reproduced
2. Excluding creation of ice-objects (added testWrite method to IHello): BadMagic reproduced
3. One server thread: still running without any errors (not reproduced)
4. Separate connection for each thread (with ice_connectionId): BadMagic reproduced

Your turn :rolleyes:

Regards, andy

benoit · December 2011

Hi Andy,

Unfortunately, I have not been able to reproduce the issue. The fact that it doesn't show up on a single threaded server indicates it's probably a server side issue. I've attached a patch to this email that adds some additional asserts to the code. Could you patch your Ice distribution with this patch and see if the problem still occurs with a debug build and if one of those asserts is hit? To apply the patch to a fresh Ice 3.4.2 source distribution:

   > cd Ice-3.4.2
   > patch -p1 < patch.txt

If an assert is triggered, it would be helpful if you could provide the values of the connection attributes (the Ice::ConnectionI object and its IceInternal::EventHandler base class) at the time of the assert.

You could also try to see if the problem still shows up when using multiple threads on the server side and specifying the Ice.ThreadPool.Server.Serialize=1 property.

Cheers,
Benoit.

andy_84 · December 2011

Hi, Benoit!

You could also try to see if the problem still shows up when using multiple threads on the server side and specifying the Ice.ThreadPool.Server.Serialize=1 property.

BadMagicException reproduced.

Could you patch your Ice distribution with this patch and see if the problem still occurs with a debug build and if one of those asserts is hit?

We applied your patch, built Debug configuration - BadMagicException reproduced without any asserts.

Regards, andy

Archived

BadMagicException on single server

Comments

Categories