BadMagicException in mid time intensive runs

Andrew S · January 2009

Hi!

We have a hard-loaded server, which uses inverted logic - it connects to the remote clients, so remote clients are actually servers. Connections are bidirectional, so every client and server uses the same connection.
Sometimes we see in our logs ICE this exception:

ThreadPool.cpp:664: Ice::BadMagicException:
unknown magic number: 0x00, 0x1b, 0x0d, 0x5f

This happens _after_ we already have made some calls on this connection, and just after this message we have an another:

Network.trying to establish tcp connection to _address_of_remote_server

Numbers always the same (f.e., 0x00, 0x1b, 0x0d, 0x5f) with same binary builds. With changed builds numbers can change too.
After this exceptions remote host can enter in down state - look like that something wrong with tcp stack.

We are using ice 3.3, windows server 2008, 8-core hosts, 8 remote clients per host.

No any other application can connect to the remote or local endpoints.
I'm not sure, but possible, when we used single-direction connections, there were no such exceptions.

Can you guess, what can happens? I'm ready to deep investigation

Thanks!

--
Andrew Solodovnikov

benoit · January 2009

Hi Andrew,

This exception shows up on the remote clients to which the server is connecting to or does it show up on your server? Could you detail the configuration of your clients and server (thread pool configuration, endpoints, etc)?

Once the Ice runtime receives a message with a bad magic number, it closes the connection. This explains why you're seeing a trace indicating that the Ice client runtime (this trace come from your server right?) tries to re-establish the connection.

By "down state", do you mean that the server or client doesn't respond anymore to new connection requests? Did you try attaching with the debugger to the process to see if there were any suspicious stack traces (you can post the stack traces here, I'll be happy to take a look)?

The best way to investigate this would be to provide us with a small self-compilable test case that reproduces the issue.

Cheers,
Benoit.

Andrew S · January 2009

Hi, Benoit, thanks for answer.

benoit wrote: »

Hi Andrew,

This exception shows up on the remote clients to which the server is connecting to or does it show up on your server? Could you detail the configuration of your clients and server (thread pool configuration, endpoints, etc)?

This happens on the server side, but actually the server is an ice client (it uses inverted logic and connects to the remote hosts). So, from ICE point of view this is client side, so we have one client and 62 servers

Ice configs (we use ours config system).
server

  <IceProperties>
    <Item>--Ice.ThreadPool.Server.Size=32</Item>
    <Item>--Ice.ThreadPool.Client.Size=8</Item>
    <Item>--Ice.ACM.Client=0</Item>
    <Item>--Ice.MessageSizeMax=56000</Item>
  </IceProperties>

client

  <IceProperties>
    <Item>--Ice.ThreadPool.Server.Size=4</Item>
    <Item>--Ice.ThreadPool.Client.Size=4</Item>
    <Item>--Ice.ACM.Client=0</Item>
    <Item>--Ice.MessageSizeMax=56000</Item>
  </IceProperties>

benoit wrote: »

Once the Ice runtime receives a message with a bad magic number, it closes the connection. This explains why you're seeing a trace indicating that the Ice client runtime (this trace come from your server right?) tries to re-establish the connection.

Yes, it is clear from ice internals.

benoit wrote: »

By "down state", do you mean that the server or client doesn't respond anymore to new connection requests? Did you try attaching with the debugger to the process to see if there were any suspicious stack traces (you can post the stack traces here, I'll be happy to take a look)?

It can respond well, sometimes it can respond veeeery slow, and sometimes it can just doesn't respond at all - not just for the ice connection, but for all network activity, so we need to reboot the server to restore it's functionality. When we use test patterns with the small response messages from client - it runs well. But in a real world application, where response can be huge (see our's configs) we get problems, such as I have mentioned before.

benoit wrote: »

The best way to investigate this would be to provide us with a small self-compilable test case that reproduces the issue.

I'm sorry, but we can't reproduce this behaviour in test cases - moreover, we can't reproduce this on test hosts - only on real cluster (8х8 core windows 2008 servers, 62 clients) with real tasks... Test tasks runs well - maybe, it is correlation with response size...

Cheers,
Benoit.[/QUOTE]

Thanks for answers!

--
Andrew Solodovnikov

benoit · January 2009

Hi Andrew,

I suggest that we keep using the term "server" for the process which is calling on the 62 clients. I think I understand your application: the server establishes connections to the 62 clients and the 62 clients use these connections to call-back on the server (with potentially large requests).

Which protocol do you use, tcp or ssl? Do you use compression? Are connections forcefully closed in your application (either explicitly with the Ice connection close method or because of timeouts)? Are the clients using oneway or twoway calls and do they serialize the calls to the server or do they potentially send many concurrent calls?

When you say that you sometime need to reboot the server, do you mean that you need to reboot the machine or just kill and start again the server process?

I'm afraid I don't know why you get these bad magic exceptions. Typically, these exceptions show up when a non-Ice client connects to the Ice server port and sends invalid data. But it sounds like this isn't possible in your case, right?

You could try enabling protocol and network tracing on both the clients and server (with --Ice.Trace.Network=2 and --Ice.Trace.Protocol=2) and analyse the tracing when the bad magic exceptions show up to see if there's anything suspicious. You can also post the traces here and I'll be happy to take a look.

Cheers,
Benoit.

Andrew S · January 2009

Hi, Benoit.

benoit wrote: »

Hi Andrew,

I suggest that we keep using the term "server" for the process which is calling on the 62 clients. I think I understand your application: the server establishes connections to the 62 clients and the 62 clients use these connections to call-back on the server (with potentially large requests).

Ok, let it be so

benoit wrote: »

Which protocol do you use, tcp or ssl? Do you use compression? Are connections forcefully closed in your application (either explicitly with the Ice connection close method or because of timeouts)? Are the clients using oneway or twoway calls and do they serialize the calls to the server or do they potentially send many concurrent calls?

We are using tcp, no compression, no oneway calls. We do not close connections manually with this pattern - connection can be closed only on network error - for example, client session timeout. We often use asynch calls.
Clients normally try to batch they data - so thay can send many concurrent big (1kb - 50mb, 1 kb much more often) blocks of data (for example, about 4-6 mb per sec summary). Clients don't serialize they calls in any meaning (on server and client part) - so we have many concurrent callbacks from clients through bidirectional connections.

benoit wrote: »

When you say that you sometime need to reboot the server, do you mean that you need to reboot the machine or just kill and start again the server process?

We often need to reboot machine after such situation - network stack seems to be in a "down state" after that - very slow network operations or ever no possibility for operations at all.

benoit wrote: »

I'm afraid I don't know why you get these bad magic exceptions. Typically, these exceptions show up when a non-Ice client connects to the Ice server port and sends invalid data. But it sounds like this isn't possible in your case, right?

Yes, i know the sources of such problems - but there is no possibility for this in my case. Moreover - we try to change port numbers - with the same effect. And one more thing - when we change binary, magic numbers can also change. I think that this behaviour really points on internal or ever tcp network stack troubles... I just hope that you have already seen something like this, otherwise it will be really difficult to find a solution or cause

benoit wrote: »

You could try enabling protocol and network tracing on both the clients and server (with --Ice.Trace.Network=2 and --Ice.Trace.Protocol=2) and analyse the tracing when the bad magic exceptions show up to see if there's anything suspicious. You can also post the traces here and I'll be happy to take a look.

Ok, i will try to do it. I can remember, that i already did this - and there were no any unexpectable in trace... but i will try it again. Thanks for help! I know, that it is really difficult to sort out such problems, espesially when there is a "sliding" bug, like this...

Many thanks!

--
Andrew.

Andrew S · January 2009

Hi!

I was able to reproduce BadMagic exception in this pattern:
1. Server sends ProcessItems request to client.
2. Server generates BadMagic exception.
3. Reestablish connection.

Here is the parts from the client and the server logs, network and protocol traces enabled:
Server:

19 Jan 2009 08:08:30.355 [Debug] Protocol.sending asynchronous request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 1903
request id = 31
identity = 73e3ea89-a22b-4be7-a555-165bead43c7f
facet = 
operation = ProcessItems
mode = 0 (normal)
context = 
19 Jan 2009 08:08:30.355 [Debug] Protocol.sending reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 25
request id = 87
reply status = 0 (ok)
19 Jan 2009 08:08:30.355 [Debug] Protocol.received reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 45
request id = 10
reply status = 0 (ok)
19 Jan 2009 08:08:30.370 [Debug] Protocol.sending reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 25
request id = 50
reply status = 0 (ok)
19 Jan 2009 08:08:30.370 [Debug] Protocol.received reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 45
request id = 23
reply status = 0 (ok)
19 Jan 2009 08:08:30.370 [Debug] Protocol.sending reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 25
request id = 11
reply status = 0 (ok)
19 Jan 2009 08:08:30.370 [Error] Can't process items on agent: agent_id = 15, err = ThreadPool.cpp:664: Ice::BadMagicException:
unknown magic number: 0x00, 0x30, 0x48, 0x65
19 Jan 2009 08:08:30.370 [Debug] Protocol.received reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 45
request id = 15
reply status = 0 (ok)
19 Jan 2009 08:08:30.370 [Debug] Network.trying to establish tcp connection to 172.16.128.5:56793
19 Jan 2009 08:08:30.370 [Debug] Network.closing tcp connection
local address = 172.16.129.177:60021
remote address = 172.16.128.5:56793
19 Jan 2009 08:08:30.370 [Debug] Network.tcp connection established
local address = 172.16.129.177:60235
remote address = 172.16.128.5:56793
19 Jan 2009 08:08:30.370 [Debug] Protocol.received reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 45
request id = 13
reply status = 0 (ok)
19 Jan 2009 08:08:30.370 [Debug] Protocol.received validate connection 
message type = 3 (validate connection)
compression status = 0 (not compressed; do not compress response, if any)
message size = 14

Client:

19 Jan 2009 08:08:29.247 [Debug] Protocol.received request 
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 1903
request id = 10
identity = 8f49b15e-3da5-4376-8934-44408839f42e
facet = 
operation = ProcessItems
mode = 0 (normal)
context = 
19 Jan 2009 08:08:29.247 [Debug] Protocol.sending reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 45
request id = 10
reply status = 0 (ok)
19 Jan 2009 08:08:29.262 [Debug] Protocol.received reply 
message type = 2 (reply)
compression status = 0 (not compressed; do not compress response, if any)
message size = 25
request id = 38
reply status = 0 (ok)
19 Jan 2009 08:08:29.403 [Debug] Protocol.sending asynchronous request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 1405
request id = 39
identity = bc64dc47-d894-4829-8e95-594544624091
facet = 
operation = OnItemComplete
mode = 0 (normal)
context = 
19 Jan 2009 08:08:29.793 [Debug] Protocol.sending asynchronous request
message type = 0 (request)
compression status = 0 (not compressed; do not compress response, if any)
message size = 3544
request id = 40
identity = bc64dc47-d894-4829-8e95-594544624091
facet = 
operation = OnItemComplete
mode = 0 (normal)
context = 
19 Jan 2009 08:08:30.293 [Debug] Network.accepted tcp connection
local address = 172.16.128.5:56793
remote address = 172.16.129.177:60235
19 Jan 2009 08:08:30.293 [Debug] Protocol.sending validate connection 
message type = 3 (validate connection)
compression status = 0 (not compressed; do not compress response, if any)
message size = 14

I don't know on which processItem request server receives badMagic response, but last request, obtained by client is 10, while last server request id is 31...

After badMagic node with the remote client has entered "network down" state - network connections became slow and we was able to restore normal state of this node only by machine reboot.

I've attached log files: http://www.zeroc.com/forums/attachment.php?attachmentid=644&stc=1&d=1232364993

Many thanks!

--
Andrew.

Andrew S · January 2009

Hi!
benoit, just FYI.
Possible, I was able to identify the source of the BadMagicException problem. There was a trouble with the network adapters configuration. Ours network adapters can offload some tcp stack operation - f.e., checksum calculation, large send split etc. All this options are enabled by default. When we had turned off large send offload option, all became fine.

May be, there is a bug in network drivers or hardware (intel pro 1000 and intel 82575EB adapters), or ours network config doesn't allow large packets... All in all, this solved the problem.

Thanks for support!

--
Andrew Solodovnikov

benoit · January 2009

Hi Andrew,

This did indeed sound like a network problem, especially since you had to reboot the machine to get it back into a functional state. Anyway, I'm glad you found a solution and thanks for letting us know!

Cheers,
Benoit.

Archived

BadMagicException in mid time intensive runs

Comments

Categories