Problem with .checkedCast

jnavapal · November 2010

Hi, I'm working with Ice and have noticed the followong problem when the primary registry's interface is down. When its interface is up, a task takes 0,10 seconds.
When the interface is down, the job takes generally 3,10 seconds. We think that this is because of the extra time needed to connect to the secondary registry. We have noticed that the error occurs when calling ExecServantPrx.checkedCast(communicator.stringToProxy());
In the forum we have noticed that people have had similar issues, but it is because of the stringToProxy call. We have noticed that this is not our case and the actual problem is with the checkedCast.
Any help would be greatly appreciated.
Thank you

bernard · November 2010

Hi Jose,

Welcome to our forums!

stringToProxy is a local call: it just manufactures a local proxy from the provided string. As a result, it's very fast.

So most of the work (in particular the indirect-proxy resolution) occurs during the first remote invocation - in your case, the checkedCast call.

You should be able to find more information on what's happening in your client by enabling tracing. For example:
Ice.Trace.Network=2
Ice.Trace.Locator=2
Ice.Trace.Retry=1

(See http://www.zeroc.com/doc/Ice-3.4.1-IceTouch/manual/PropRef.50.2.html for details).

Best regards,
Bernard

jnavapal · November 2010

Hi Bernard, thank you for answering.

First of all I would like to quickly explain the situation:

I have a Primary and Secondary registry.
And two clients (clientA -> Primary , clientB -> Secondary), one on the primary registry, and the other on the secondary for HA.

For some uknown reason, the primary registry randomly stops its network interfaces, making any connection impossible.
My original configuration of both clients, used the following endponts:
Ice.Default.Locator=Grid/Locator:tcp -h primary -p 4061:tcp -h secondary -p 4061

The problem was, clientA would obviously not be able to connect because its network was down.
And clientB would connect, but as it searches for its endpoints randomly by default, it would
sometimes spend unnecesarry time trying to connect to the primary registry, and after a random time (generaly 3 seconds, although sometimes much more), it would end
up connecting to the secondary registry.

The solution I have used was to set the clientB locator to:

Ice.Default.Locator=Grid/Locator:tcp -h secondary-p 4061:tcp -h primary -p 4061
&
Ice.Default.EndpointSelection=Ordered

With this configuration, if clientA is ok, it will balance the load over the primary and secondary registry.
If clientA is down, as is the primary registry because of network interface failure on that node,
clientB connects to the secondary registry. There is no point of balancing load when there is only one node

So far so good, but I have noticed the following problem, when executing from clientB (that should only connect to the secondary registry),
when i stop the network interface on the primary registry, after one minute aprox. (tcp heaven on linex is 60 seconds), the node that executes takes a LONG
time to respond:

-- 11/03/10 14:09:21.133 Locator: searching for object by id
object = HPCBanestoV2Servant
-- 11/03/10 14:09:21.137 Network: trying to establish tcp connection to 192.10.0.230:4061
-- 11/03/10 14:09:21.137 Network: tcp connection established
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061
-- 11/03/10 14:09:21.137 Network: received 14 of 14 bytes via tcp
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061
-- 11/03/10 14:09:21.137 Network: sent 86 of 86 bytes via tcp
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061
-- 11/03/10 14:09:21.137 Network: received 14 of 14 bytes via tcp
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061
-- 11/03/10 14:09:21.138 Network: received 64 of 64 bytes via tcp
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061
-- 11/03/10 14:09:21.138 Locator: retrieved endpoints from locator, adding to locator table
object = HPCBanestoV2Servant
endpoints = tcp -h 192.10.0.6 -p 35731
-- 11/03/10 14:09:21.138 Network: trying to establish tcp connection to 192.10.0.6:35731
-- 11/03/10 14:09:21.138 Network: tcp connection established
local address = 192.10.0.230:49816
remote address = 192.10.0.6:35731
-- 11/03/10 14:09:21.138 Network: received 14 of 14 bytes via tcp
local address = 192.10.0.230:49816
remote address = 192.10.0.6:35731
-- 11/03/10 14:09:21.138 Network: sent 84 of 84 bytes via tcp
local address = 192.10.0.230:49816
remote address = 192.10.0.6:35731
-- 11/03/10 14:09:21.138 Network: received 14 of 14 bytes via tcp
local address = 192.10.0.230:49816
remote address = 192.10.0.6:35731
-- 11/03/10 14:09:21.138 Network: received 12 of 12 bytes via tcp
local address = 192.10.0.230:49816
remote address = 192.10.0.6:35731
-- 11/03/10 14:09:21.139 Network: sent 1413 of 1413 bytes via tcp
local address = 192.10.0.230:49816
remote address = 192.10.0.6:35731

Exactly one minute after (I think its the timeout of 60 seconds for tcp in linux):

bernard · November 2010

Hi Jose,

I don't see any issue with the trace you posted:
192.10.0.230 = secondary registry?
192.10.0.6 = target object adapter
and your entire trace covers just 6 ms (14:09:21.133 to 14:09:21.139)

Could you clarify where you see a 60s delay?

Best regards,
Bernard

jnavapal · November 2010

I am sorry, the message seemed to cut off.
I have pasted the rest of the message:

Exactly one minute after (I think its the timeout of 60 seconds for tcp in linux):

-- 11/03/10 14:10:21.192 Network: sent 14 of 14 bytes via tcp
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061
-- 11/03/10 14:10:21.193 Network: closing tcp connection
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061

This is the LONG time i was talking about!!!

-- 11/03/10 14:24:42.250 Network: received 14 of 14 bytes via tcp
local address = 192.10.0.230:49816
remote address = 192.10.0.6:35731
-- 11/03/10 14:24:42.250 Network: received 406 of 406 bytes via tcp
local address = 192.10.0.230:49816
remote address = 192.10.0.6:35731

The end result is a task that should take 0,13 seconds, takes 921 seconds.
After this, all works perfectly.
Any help would be greatly appreciated.

bernard · November 2010

Hi Jose,

-- 11/03/10 14:10:21.192 Network: sent 14 of 14 bytes via tcp
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061
-- 11/03/10 14:10:21.193 Network: closing tcp connection
local address = 192.10.0.230:52030
remote address = 192.10.0.230:4061

This just indicates that the Ice client runtime is closing the connection to the registry, most likely because it has been idle for over a minute. That's the default behavior; while you troubleshoot these issues, you may want to disable client-side active connection management (ACM): see http://www.zeroc.com/doc/Ice-3.4.1-IceTouch/manual/PropRef.50.11.html

The rest of your trace is 14 minutes later. Perhaps another pasting issue? It may be easier to attach your trace log than paste it inline.

The first thing you want to find out here is what is taking so much time. Is it the connection establishment to the registry? To the target server? Or is it a request sent to the server that takes unusually long? (In that case, you'll want to look more closely at the implementation of this operation).

Tracing and possibly attaching with a debugger are your best tools here. You can enable protocol tracing to get information about the requests sent/received by Ice:
Ice.Trace.Protocol=1

Best regards,
Bernard

Archived

Problem with .checkedCast

Comments

Categories