icegridnode manage programe problem

liyanux · June 2012

I use ice 3.4.2 icegridregistry and icegridnode manage my programe，icegridregistry config is:

IceGrid.InstanceName=substation
IceGrid.Registry.Client.Endpoints=default -h 192.168.3.94 -p 4061 -t 5000
IceGrid.Registry.Server.Endpoints=default -h 192.168.3.94 -t 5000
IceGrid.Registry.Internal.Endpoints=default -h 192.168.3.94 -t 5000
IceGrid.Registry.Data=iceDb/registry
IceGrid.Registry.PermissionsVerifier=substation/NullPermissionsVerifier
IceGrid.Registry.AdminPermissionsVerifier=substation/NullPermissionsVerifier
IceGrid.Registry.NodeSessionTimeout=10
IceGrid.Registry.ReplicaSessionTimeout=10
IceGrid.Registry.Trace.Adapter=0
IceGrid.Registry.Trace.Application=0
IceGrid.Registry.Trace.Locator=0
IceGrid.Registry.Trace.Node=1
IceGrid.Registry.Trace.Object=0
IceGrid.Registry.Trace.Patch=0
IceGrid.Registry.Trace.Replica=0
IceGrid.Registry.Trace.Server=1
IceGrid.Registry.Trace.Session=0
Ice.Default.Locator=substation/Locator:tcp -h 192.168.3.94 -p 4061

icegridnode config is:

Ice.Default.Locator=substation/Locator: default -h 192.168.3.94 -p 4061 -t 5000
IceGrid.Node.Name=datasvrNode
IceGrid.Node.Endpoints=default -t 5000
IceGrid.Node.Data=iceDb/datasvr

xml config is:

<icegrid>
<application  name="datasvr">
<node name="datasvrNode">
<server id="datasvrServer" exe="sudo" activation="always">
<option>/gridnt/bin/datasvr/datasvr</option>
<option>-flagfile=/gridnt/bin/datasvr/datasvr.cfg</option>
<adapter name="datasvrAdapter" id="datasvrAdapter" endpoints="tcp -h 192.168.3.94"/>
<property name="Subscriber.Endpoints" value="tcp -h 192.168.3.94"/>
<property name="Ice.MessageSizeMax" value="10240"/>
<property name="TopicManager.Proxy" value="substation/TopicManager:tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020"/>
</server>
</node>
</application>
</icegrid>

general this work good,I use "pkill -9 datasvr" command kill the datasvr programe,it can restart ,but when I use pkill command quickly and >=10 count,the programe can't restart.datasvrServer status is inactive,must server enable can restart the programe.

my question is why cause it ,how can i resolv it ,please tell me ,thanks.

benoit · June 2012

Hi,

Can you add IceGrid.Node.Trace.Server=2 and IceGrid.Node.Trace.Activator=2 to your IceGrid node configuration file, try again and post the traces here?

Cheers,
Benoit.

liyanux · June 2012

thank you for the reply,above is the strace log:

-- 06/15/12 17:28:50.067 icegridnode: Server: changed server `gooseServer' state to `Active'
-- 06/15/12 17:28:50.126 icegridnode: Activator: detected termination of server `gooseServer'
   exit code = 137
-- 06/15/12 17:28:50.126 icegridnode: Server: changed server `gooseServer' state to `Inactive'
-- 06/15/12 17:28:50.626 icegridnode: Activator: activating server `gooseServer'
   path = sudo
   pwd = /gridnt/bin/collector
   uid/gid = 99/99
   args = sudo /gridnt/bin/collector/goose -flagfile=/gridnt/bin/collector/goose.cfg --Ice.Config=/gridnt/bin/collector/iceDb/goose/servers/gooseServer/config/config
-- 06/15/12 17:28:50.681 icegridnode: Server: changed server `gooseServer' state to `Active'
-- 06/15/12 17:28:51.308 icegridnode: Activator: detected termination of server `gooseServer'
   exit code = 137
-- 06/15/12 17:28:51.309 icegridnode: Server: changed server `gooseServer' state to `Inactive'
-- 06/15/12 17:28:51.809 icegridnode: Activator: activating server `gooseServer'
   path = sudo
   pwd = /gridnt/bin/collector
   uid/gid = 99/99
   args = sudo /gridnt/bin/collector/goose -flagfile=/gridnt/bin/collector/goose.cfg --Ice.Config=/gridnt/bin/collector/iceDb/goose/servers/gooseServer/config/config
-- 06/15/12 17:28:51.838 icegridnode: Activator: detected termination of server `gooseServer'
   exit code = 137
-- 06/15/12 17:28:51.843 icegridnode: Server: changed server `gooseServer' state to `Inactive'

benoit · June 2012

The traces don't show something wrong. So at this point the server doesn't restart on-demand? What does the icegridadmin "server state <server id>" command returns? Are you able to reproduce this with the IceGrid simple demo? Can you also specify on which platform this is occurring?

Cheers,
Benoit.

liyanux · June 2012

i use centos 6.2 x86 64bit system，use icegridadmin check state is:

bash-4.1# icegridadmin -uadmin -padmin -e "server state gooseServer" --Ice.Config=/gridnt/bin/collector/icegridcnf/goose.node
inactive (disabled)

i will try IceGrid simple demo later,thanks for the reply.

liyanux · June 2012

Hi,
I try the icegrid simple case is right,before the server is run as nobody user,today i modify ice config make the server run as root user,now the server can auto restart,i think this situation is because of our programe or system?

benoit · June 2012

Hi,

Are you using the "always" activation mode for your server? IceGrid will automatically disable a server using the "always" activation mode if this server keeps dying before turning to the "Active" state. This ensure that mis-behaving servers don't keep spawning and dying indefinitely.

Cheers,
Benoit.

liyanux · June 2012

yes ,i use the using the "always" activation mode.

benoit · June 2012

Hi,

Ok that explains it. If you kill the process before it's being active and if its activation mode is "always", the IceGrid node will automatically disable it to prevent the process to spawn indefinitely in case the process is mis-behaving during startup.

Cheers,
Benoit.

liyanux · July 2012

Hi Benoit,
thank you for the reply,i now understand this question. but recently,i come across another question：for test ，when i kill the icebox icegridnode process，the icebox server process will auto exit，that is why?
my icebox config file is :

<icegrid>
<application name="icebox">

<server-template id="IceStormTemplate">
<parameter name="index"/>
<parameter name="node-endpoints"/>
<parameter name="nodes-0"/>
<parameter name="nodes-1"/>
<parameter name="nodes-2"/>
<parameter name="topic-manager-endpoints"/>
<parameter name="replicated-topic-manager-endpoints"/>
<parameter name="instance-name"/>
<parameter name="publish-endpoints"/>
<parameter name="replicated-publish-endpoints"/>

<icebox id="${instance-name}-${index}" exe="icebox" activation="always">

<service name="IceStorm" entry="IceStormService,34:createIceStorm">

<dbenv name="${service}"/>

<properties>
<property name="${service}.NodeId" value="${index}"/>
<property name="${service}.Node.Endpoints" value="${node-endpoints}"/>
<property name="${service}.Nodes.0" value="${nodes-0}"/>
<property name="${service}.Nodes.1" value="${nodes-1}"/>
<property name="${service}.Nodes.2" value="${nodes-2}"/>
<property name="${service}.TopicManager.Endpoints" value="${topic-manager-endpoints}"/>
<property name="${service}.ReplicatedTopicManagerEndpoints" value="${replicated-topic-manager-endpoints}"/>
<property name="${service}.InstanceName" value="${instance-name}"/>
<property name="${service}.Publish.Endpoints" value="${publish-endpoints}"/>
<property name="${service}.ReplicatedPublishEndpoints" value="${replicated-publish-endpoints}"/>

<property name="${service}.Trace.TopicManager" value="2"/>
<property name="${service}.Trace.Topic" value="1"/>
<property name="${service}.Trace.Subscriber" value="1"/>
<property name="${service}.Trace.Election" value="1"/>
</properties>
</service>
</icebox>
</server-template>

<replica-group id="PublishReplicaGroup">
</replica-group>

<replica-group id="TopicManagerReplicaGroup">
<object identity="substation/TopicManager" type="::IceStorm::TopicManager"/>
</replica-group>

<node name="iceboxNode">
<server-instance template="IceStormTemplate" index="0" 
node-endpoints="default  -h 192.168.3.94 -p 13000" 
nodes-0="substation/node0:default  -h 192.168.3.94 -p 13000"
nodes-1="substation/node1:default  -h 192.168.3.94 -p 13010"
nodes-2="substation/node2:default  -h 192.168.3.94 -p 13020"
topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000" 
replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" 
instance-name="substation" 
publish-endpoints="tcp -h 192.168.3.94 -p 10001" 
replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" 
/>
<server-instance template="IceStormTemplate" index="1" 
node-endpoints="default  -h 192.168.3.94 -p 13010" 
nodes-0="substation/node0:default  -h 192.168.3.94 -p 13000"
nodes-1="substation/node1:default  -h 192.168.3.94 -p 13010"
nodes-2="substation/node2:default  -h 192.168.3.94 -p 13020"
topic-manager-endpoints="tcp -h 192.168.3.94 -p 10010" 
replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" 
instance-name="substation" 
publish-endpoints="tcp -h 192.168.3.94 -p 10011" 
replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" 
/>
<server-instance template="IceStormTemplate" index="2" 
node-endpoints="default  -h 192.168.3.94 -p 13020" 
nodes-0="substation/node0:default  -h 192.168.3.94 -p 13000"
nodes-1="substation/node1:default  -h 192.168.3.94 -p 13010"
nodes-2="substation/node2:default  -h 192.168.3.94 -p 13020"
topic-manager-endpoints="tcp -h 192.168.3.94 -p 10020" 
replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" 
instance-name="substation" 
publish-endpoints="tcp -h 192.168.3.94 -p 10021" 
replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" 
/>
</node>
</application>
</icegrid>

benoit · July 2012

Hi,

This is the expected behavior, the IceGrid node deactivates all the servers it manages when it is shutdown.

Cheers,
Benoit.

liyanux · July 2012

Hi Benoit,
thank you for the reply,i known it now,thanks.:)

liyanux · July 2012

Hi Benoit,
my icebox config xml is already is above #10，nomally the icebox work good,for test, i write a shell for test,the shell is:
#!/bin/sh

while true
do
kill `ps ax|grep substation-2|grep icebox |awk '{print $1}'`
sleep 1
kill `ps ax|grep substation-1|grep icebox |awk '{print $1}'`
sleep 1
kill `ps ax|grep substation-0|grep icebox |awk '{print $1}'`
sleep 1
done

this work 10 minutes,i stop the shell,the icebox log is:

-- 07/06/12 15:22:07.120 substation-2-IceStorm: Election: node 2: I have the latest database state.
-- 07/06/12 15:22:07.125 substation-0-IceStorm:substation/topic.ThrugoutList subscribers: 9F244485-DA7A-4AD8-9E23-202C1D3A091B endpoints: "tcp -h 192.168.3.109 -p 52710"
-- 07/06/12 15:22:07.157 substation-1-IceStorm:substation/topic.ThrugoutList subscribers: 9F244485-DA7A-4AD8-9E23-202C1D3A091B endpoints: "tcp -h 192.168.3.109 -p 52710"


-- 07/06/12 15:24:07.176 substation-1-IceStorm: Subscriber: 0xaf0ec0 9F244485-DA7A-4AD8-9E23-202C1D3A091B subscriber errored out: ConnectionI.cpp:1661: Ice::ConnectTimeoutException:
   timeout while establishing a connection retry: 0/0


-- 07/06/12 15:24:08.060 substation-2-IceStorm: Topic: ThrugoutList: reap 9F244485-DA7A-4AD8-9E23-202C1D3A091B
-- 07/06/12 15:24:08.078 substation-0-IceStorm: Topic: ThrugoutList: remove replica observer: 9F244485-DA7A-4AD8-9E23-202C1D3A091B llu: 1/12
-- 07/06/12 15:24:08.078 substation-1-IceStorm: Topic: ThrugoutList: remove replica observer: 9F244485-DA7A-4AD8-9E23-202C1D3A091B llu: 1/12

from above it link the subscribe of ThrugoutList is lost because of ConnectTimeoutException,but i use icestormadmin comman look the topic:ThrugoutList is exist ,how can i resolv the problem ,please help me ,thanks.

liyanux · July 2012

the replica state is:

[root@demo109 appsvr]# icestormadmin --Ice.Config=./config.sub
Ice 3.4.2  Copyright 2003-2011 ZeroC, Inc.
>>> replica
replica count: 3
0: id:         0
0: coord:      2
0: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E
0: state:      normal
0: group:
0: max:        3
1: id:         1
1: coord:      2
1: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E
1: state:      normal
1: group:
1: max:        3
2: id:         2
2: coord:      2
2: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E
2: state:      normal
2: group:      0,1
2: max:        3

liyanux · July 2012

Hi,
now i read the Ice-3.4.2 《Ice-Manual.pdf》,it refer:

A retry count of  adds some resiliency to your IceStorm application by ignoring intermittent network failures such as  -1
. However, there is also some risk inherent in using a retry count of  because an improperly configured ConnectionRefusedException -1
subscriber may never be removed. For example, consider what happens when a subscriber registers using a transient endpoint: if that
subscriber happens to terminate and resubscribe with a different endpoint, IceStorm will continue trying to deliver events to the subscriber at
its old endpoint. IceStorm can only remove the subscriber if it receives a hard error, and that is only possible when the subscriber is
reachable.

can i use this config? thanks .

benoit · July 2012

Hi,

You should first figure out what is causing this connection timeout exception and see whether or not this is something expected. If this is something expected and which can happen from time to time, you can indeed use the retry count IceStorm QoS to get IceStorm to retry the connection establishment to your subscriber.

Cheers,
Benoit.

liyanux · July 2012

Hi Benoit,
thanks reply for the help,i will check the reason ,thanks,:)

liyanux · July 2012

Hi Benoit,
i think i find the reason,because of the CLOSE_WAIT cause Ice::ConnectTimeoutException.

substation/topic.ThrugoutList subscribers: B069E7B8-E95B-4C95-A46C-A412F64E1A0E endpoints: "tcp -h 192.168.3.109 -p 59636"

[root@demo109 appsvr]# netstat -an|grep 59636
tcp        0      0 192.168.3.109:59636         0.0.0.0:*                   LISTEN
tcp     3436      0 192.168.3.109:59636         192.168.3.109:53884         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:53972         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:54023         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:54019         CLOSE_WAIT
tcp        0      0 192.168.3.109:54023         192.168.3.109:59636         FIN_WAIT2
tcp        1      0 192.168.3.109:59636         192.168.3.109:54017         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:54018         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:54020         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:54021         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:53896         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:53975         CLOSE_WAIT
tcp        1      0 192.168.3.109:59636         192.168.3.109:53887         CLOSE_WAIT

my question is our ice code think about this Situation or need i modify my linux system config(centos 6.2).
thanks.

benoit · July 2012

Hi,

The CLOSE_WAIT indicates that your subscriber isn't closing its connections. The most likely reason is that your subscriber is somehow "hanging" or running into a deadlock that prevents it from dispatching or reading incoming messages. The best would be to attach the debugger to the subscriber and get a thread dump of all its threads to see what the threads are doing.

Cheers,
Benoit.

liyanux · July 2012

Hi Benoit,
i now find the reason ,because of my code cause dead lock，i now resolv it，thank you for help。:)

Archived

icegridnode manage programe problem

Comments

Categories