What are reasons for BadMagicException

rhochmuth · May 2007

Hi Zeroc, I'm trying to debug BadMagicException and other protocol errors in my application. The application is using bi-directional connections and sending lot's of data in both directions. The exceptions occur usually after a relatively long amount of time and depending on the systems, can take anywhere from 20 minutes to twenty hours to hit. Errors have been moving around. For example, my latest exception occurred on line 751 in ThreadPool.cpp. Previousely, I've hit line 169 in SslClientTransceiver.cpp. and others locations.

Currently using Ice 2.1. Do you have any ideas on what the issue might be or what I could do to debug this.

Regards --Roland

benoit · May 2007

Hi Roland,

I can't see any exception thrown from ThreadPool.cpp line 751 in Ice 2.1.0, perhaps you meant line 728? (that's where Ice::BadMagicException can be thrown). On which platform does this occur?

It sounds like your Ice application receives unexpected data over the connection (the Ice::BadMagicException indicates that the first 4 bytes received for a request don't match the expected bytes 'IceP'). It's strange that it only occurs randomly after some time. We need more information to figure out where this could come from, so here's few questions...

(ideally, a small test case would be best).

Does this occur with both the tcp and ssl endpoints? Or is this only for ssl? Are you sure you're not picking up an OpenSSL library which doesn't match the one used to build the IceSSL plugin? Is your client or server forcefully closing connections with Ice::Connection::close(true) and if that's the case could this by any chance occur when the connection is forcefully closed by the peer?

Also, could you perhaps try with a recent Ice version?

Cheers,
Benoit.

rhochmuth · May 2007

Hi Benoit, I guess I reported the wrong line number. Not sure how I got that wrong. The issue is occuring on Windows for both tcp and ssl. I do force close connections and I'll try running without that set. It is a little difficult to trace which side is closing down first, but possible it is the other end force closing. I'm positive I'm picking up the correct SSL libraries. I'll also try upgrading to Ice 3.2. I was also going to attempt a small test case, but I don't have one yet.

Regards --Roland

rhochmuth · May 2007

Hi Benoit, I've created a program derived from the bidir demo program that seems to show a problem. What I'm seeing is that the program ends in deadlock after running for a rather long time, around thirty minutes on my system. The deadlock occurs as follows:

1. TcpTransceiver.cpp, line 132. Waiting for select(...) to return. It looks like on line 97 ::send(...) returned SOCKET_ERROR, and then when select(...) is invoked it never returns.

2. CallbackI.cpp, line 58. Trying to lock a mutex, but the same mutex is already held in the callstack that is waiting on select above.

This isn't the same issue that I'm seeing in my program, but I think that is because the timeouts are set to infinite here so it just hangs on select in this case.

The main program modification is that both a client and server thread are created that just continually write 64K buffers over the bidir connection using oneway invocations. I've hard-coded a bunch of properties in the client and server apps. I just run the program locally on my system. I was using Ice 2.1 when running this test. I haven't tried 3.2 yet.

Regards --Roland

benoit · May 2007

Hi Roland,

Could you post the stack traces of the client & server when it deadlocks? I'm not sure this is the same problem. From looking at your code, I think the deadlock might be expected:

- the client calls CallbackSender::callback() with a oneway call from the CallbackClientI::send() method and with the m_mutex locked. The client callback method implementation also locks the m_mutex.

- the server calls CallbackReceiver::callback() with a oneway call from the CallbackSenderI::send() method and with the m_mutex locked. The server callback method implementation also locks the m_mutex.

If the oneway call in the client blocks (because the TCP/IP buffer is full for example), the m_mutex won't be released. As a result, the server will also endup blocking because the client can't dispatch the callback() call from the server since m_mutex is locked.

At this point, I believe you'll get a deadlock. The client oneway remains blocked because the server thread that is reading over the connection to dispatch a callback call from the client is stuck waiting to acquire the m_mutex mutex from CallbackSenderI.

The best would be to release the m_mutex mutex before calling callback() (either on the client or server).

Cheers,
Benoit.

rhochmuth · May 2007

Hi Benoit, Thanks for the very nice explanation! You are probably right. One thing that I failed to model in this demo is that in my application the invocations are supposedely paced. There should only be at most six oustanding oneways before waiting for a incoming messages. That is the theory anyway. I'll need to validate that. In this sample, there is no pacing and the invocations could get buffered up, until space runs out, and that is when the send(...) and then select(...) doesn't return while holding m_mutex, if I understood what you are saying.

I'll keep plugging away and get you the call stack.

Regards --Roland

rhochmuth · June 2007

Hi Benoit, Sorry about the delay. I've been going down a few other paths in an attempt to resolve this issue on my own, but no such luck yet. What I ended up doing is starting with my product, and carving out and simplifying as much as I could, while still replicating the issue. I've attached two simple programs.

1. pusb.
2. vusb.

pusb is the "client" side. vusb is the "server" side.

If you are going to build and run this the solutions are located in ...\DisconnectDefect\pusb and ...\DisconnectDefect\vusb respectively. If you have Ice 2.1.2 installed in c:\Ice-2.1.2 it should be painless to build and run both. Just start up vusb on a system first, and then start up pusb and pusb will prompt for the hostname to connect to. Eventually, the program will fail, sometimes very quickly, like < 30 secs, but sometimes longer, possibly 30 minutes.

Each end has a thread class SendMessageThread that sends messages to the other end, like the earlier example that started with the bidir demo. Hopefully, I've been smarter about pacing so I don't have the same issue as above where it appeared one end couldn't keep up and eventually some buffer just filled up.

With this program I've seen a variety of failures, such as BadMagicException, IceSSL::ProtocolException, and others. So it seems to be replicating what I'm seeing in my program.

I've also attached a pdf of the Class diagram which might help out describing the inheritance.

I'll still be plugging away on this, but if you have any ideas please let me know. I don't see any obvious functional differences between the bidir demo program that I sent last week and this demo program. The inheritance is different and I was wondering if the double diamond that you'll see in the class diagram was the cause of the problem. I'll be removing that later today.

Regards --Roland

rhochmuth · June 2007

Hi Benoit, Removing the multiple inheritance from the attached demo program had no impact.

Regards --Roland

benoit · June 2007

Hi Roland,

Thanks for this new test case.

I didn't look closely at the code yet but I was able to compile it. It's currently running. So far, it works fine (it has been running for more than 30 minutes). I'm running it on a Windows XP machine and it was compiled with VC71 using the Ice 2.1.2 binary distribution.

Are you running the client & server on the same machine or different machines? Which Windows version do you use exactly?

Thanks,

Benoit.

rhochmuth · June 2007

Hi Benoit, Thanks for taking a look at this so quickly. It will fail when running on the same machine, but it usually takes much longer. My development system is a dual core laptop. Also, when running one end, usually the server, in a virtual machine, such as vmware workstation, it can take a while to fail. I've had the best luck when running it over a network between two separate systems. In that case I've seen failure occur within 30 secs. Usually my laptop is the client and the other end is a dual processor workstation.

Regards --Roland

benoit · June 2007

Hi Roland,

I was finally able to reproduce the Ice::BadMagicException by running your test case on two Windows XP machines on a 100MB ethernet network. Thanks again for the test case.

I then tried the test case with Ice 3.2.0 and so far it works just fine. I suspect a problem with the IceSSL plugin from Ice 2.1.2. Could you try with Ice 3.2.0 to see if it works fine for you as well? Note that the IceSSL plugin was completely re-written and improved in Ice 3.1.0 so it would be a good idea to upgrade to the latest Ice 3.2.0 version. Of course, we can help you with the upgrade if needed.

I'm attaching the test case ported to Ice 3.2.0. You should be able to build it with VC71 and the Ice-3.2.0-VC71.msi package from our download page.

Cheers,
Benoit.

rhochmuth · June 2007

Hi Benoit, Thanks again for the quick response! I was able to get my sample programs working with Ice 3.2.0. At 3.2.0 I was unable to reproduce the disconnect. It looks like 3.2.0 resolves the problem. I'll be looking into porting our application to Ice 3.2.0 next. I think I have a good handle on how to do that and that it will go relatively smoothly.

Thanks again --Roland

rhochmuth · June 2007

Hi Benoit, The port to Ice 3.2 for our product has gone smoothly and all seems to be working so far. I haven't tested for the disconnect issue above.

There are a couple of properties that have been removed and I'm wondering what to do about them. These are the properties and values that we set.

IceSSL.Client.Handshake.Retries=2
Ice.ConnectionIdleTime=0
Ice.Warn.Leaks=0

Do these properties map to anything in Ice 3.2?

Thanks --Roland

benoit · June 2007

Hi Roland,

I'm glad to hear the transition went smoothly!

The IceSSL.Client.Handshake.Retries property was removed from thew new IceSSL plugin. Retries are now only handled at the Ice runtime level.

The Ice.ConnectionIdleTime property was deprecated in 2.1.1 (and then removed) by the Ice.ACM.Client and Ice.ACM.Server properties. Server side ACM is disabled by default but you should disable client side ACM with Ice.ACM.Client=0 if you previously set Ice.ConnectionIdleTime to 0.

The Ice.Warn.Leaks property and the leak warning was removed.

Cheers,
Benoit.

Archived

What are reasons for BadMagicException

Comments

Categories