Archived
This forum has been archived. Please start a new discussion on GitHub.
icegridnode manage programe problem
I use ice 3.4.2 icegridregistry and icegridnode manage my programe,icegridregistry config is:
icegridnode config is:
general this work good,I use "pkill -9 datasvr" command kill the datasvr programe,it can restart ,but when I use pkill command quickly and >=10 count,the programe can't restart.datasvrServer status is inactive,must server enable can restart the programe.
my question is why cause it ,how can i resolv it ,please tell me ,thanks.
IceGrid.InstanceName=substation IceGrid.Registry.Client.Endpoints=default -h 192.168.3.94 -p 4061 -t 5000 IceGrid.Registry.Server.Endpoints=default -h 192.168.3.94 -t 5000 IceGrid.Registry.Internal.Endpoints=default -h 192.168.3.94 -t 5000 IceGrid.Registry.Data=iceDb/registry IceGrid.Registry.PermissionsVerifier=substation/NullPermissionsVerifier IceGrid.Registry.AdminPermissionsVerifier=substation/NullPermissionsVerifier IceGrid.Registry.NodeSessionTimeout=10 IceGrid.Registry.ReplicaSessionTimeout=10 IceGrid.Registry.Trace.Adapter=0 IceGrid.Registry.Trace.Application=0 IceGrid.Registry.Trace.Locator=0 IceGrid.Registry.Trace.Node=1 IceGrid.Registry.Trace.Object=0 IceGrid.Registry.Trace.Patch=0 IceGrid.Registry.Trace.Replica=0 IceGrid.Registry.Trace.Server=1 IceGrid.Registry.Trace.Session=0 Ice.Default.Locator=substation/Locator:tcp -h 192.168.3.94 -p 4061
icegridnode config is:
Ice.Default.Locator=substation/Locator: default -h 192.168.3.94 -p 4061 -t 5000 IceGrid.Node.Name=datasvrNode IceGrid.Node.Endpoints=default -t 5000 IceGrid.Node.Data=iceDb/datasvrxml config is:
<icegrid> <application name="datasvr"> <node name="datasvrNode"> <server id="datasvrServer" exe="sudo" activation="always"> <option>/gridnt/bin/datasvr/datasvr</option> <option>-flagfile=/gridnt/bin/datasvr/datasvr.cfg</option> <adapter name="datasvrAdapter" id="datasvrAdapter" endpoints="tcp -h 192.168.3.94"/> <property name="Subscriber.Endpoints" value="tcp -h 192.168.3.94"/> <property name="Ice.MessageSizeMax" value="10240"/> <property name="TopicManager.Proxy" value="substation/TopicManager:tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020"/> </server> </node> </application> </icegrid>
general this work good,I use "pkill -9 datasvr" command kill the datasvr programe,it can restart ,but when I use pkill command quickly and >=10 count,the programe can't restart.datasvrServer status is inactive,must server enable can restart the programe.
my question is why cause it ,how can i resolv it ,please tell me ,thanks.
0
Comments
-
Hi,
Can you add IceGrid.Node.Trace.Server=2 and IceGrid.Node.Trace.Activator=2 to your IceGrid node configuration file, try again and post the traces here?
Cheers,
Benoit.0 -
thank you for the reply,above is the strace log:
-- 06/15/12 17:28:50.067 icegridnode: Server: changed server `gooseServer' state to `Active' -- 06/15/12 17:28:50.126 icegridnode: Activator: detected termination of server `gooseServer' exit code = 137 -- 06/15/12 17:28:50.126 icegridnode: Server: changed server `gooseServer' state to `Inactive' -- 06/15/12 17:28:50.626 icegridnode: Activator: activating server `gooseServer' path = sudo pwd = /gridnt/bin/collector uid/gid = 99/99 args = sudo /gridnt/bin/collector/goose -flagfile=/gridnt/bin/collector/goose.cfg --Ice.Config=/gridnt/bin/collector/iceDb/goose/servers/gooseServer/config/config -- 06/15/12 17:28:50.681 icegridnode: Server: changed server `gooseServer' state to `Active' -- 06/15/12 17:28:51.308 icegridnode: Activator: detected termination of server `gooseServer' exit code = 137 -- 06/15/12 17:28:51.309 icegridnode: Server: changed server `gooseServer' state to `Inactive' -- 06/15/12 17:28:51.809 icegridnode: Activator: activating server `gooseServer' path = sudo pwd = /gridnt/bin/collector uid/gid = 99/99 args = sudo /gridnt/bin/collector/goose -flagfile=/gridnt/bin/collector/goose.cfg --Ice.Config=/gridnt/bin/collector/iceDb/goose/servers/gooseServer/config/config -- 06/15/12 17:28:51.838 icegridnode: Activator: detected termination of server `gooseServer' exit code = 137 -- 06/15/12 17:28:51.843 icegridnode: Server: changed server `gooseServer' state to `Inactive'
0 -
The traces don't show something wrong. So at this point the server doesn't restart on-demand? What does the icegridadmin "server state <server id>" command returns? Are you able to reproduce this with the IceGrid simple demo? Can you also specify on which platform this is occurring?
Cheers,
Benoit.0 -
i use centos 6.2 x86 64bit system,use icegridadmin check state is:
bash-4.1# icegridadmin -uadmin -padmin -e "server state gooseServer" --Ice.Config=/gridnt/bin/collector/icegridcnf/goose.node inactive (disabled)
i will try IceGrid simple demo later,thanks for the reply.0 -
Hi,
I try the icegrid simple case is right,before the server is run as nobody user,today i modify ice config make the server run as root user,now the server can auto restart,i think this situation is because of our programe or system?0 -
Hi,
Are you using the "always" activation mode for your server? IceGrid will automatically disable a server using the "always" activation mode if this server keeps dying before turning to the "Active" state. This ensure that mis-behaving servers don't keep spawning and dying indefinitely.
Cheers,
Benoit.0 -
yes ,i use the using the "always" activation mode.0
-
Hi,
Ok that explains it. If you kill the process before it's being active and if its activation mode is "always", the IceGrid node will automatically disable it to prevent the process to spawn indefinitely in case the process is mis-behaving during startup.
Cheers,
Benoit.0 -
Hi Benoit,
thank you for the reply,i now understand this question. but recently,i come across another question:for test ,when i kill the icebox icegridnode process,the icebox server process will auto exit,that is why?
my icebox config file is :<icegrid> <application name="icebox"> <server-template id="IceStormTemplate"> <parameter name="index"/> <parameter name="node-endpoints"/> <parameter name="nodes-0"/> <parameter name="nodes-1"/> <parameter name="nodes-2"/> <parameter name="topic-manager-endpoints"/> <parameter name="replicated-topic-manager-endpoints"/> <parameter name="instance-name"/> <parameter name="publish-endpoints"/> <parameter name="replicated-publish-endpoints"/> <icebox id="${instance-name}-${index}" exe="icebox" activation="always"> <service name="IceStorm" entry="IceStormService,34:createIceStorm"> <dbenv name="${service}"/> <properties> <property name="${service}.NodeId" value="${index}"/> <property name="${service}.Node.Endpoints" value="${node-endpoints}"/> <property name="${service}.Nodes.0" value="${nodes-0}"/> <property name="${service}.Nodes.1" value="${nodes-1}"/> <property name="${service}.Nodes.2" value="${nodes-2}"/> <property name="${service}.TopicManager.Endpoints" value="${topic-manager-endpoints}"/> <property name="${service}.ReplicatedTopicManagerEndpoints" value="${replicated-topic-manager-endpoints}"/> <property name="${service}.InstanceName" value="${instance-name}"/> <property name="${service}.Publish.Endpoints" value="${publish-endpoints}"/> <property name="${service}.ReplicatedPublishEndpoints" value="${replicated-publish-endpoints}"/> <property name="${service}.Trace.TopicManager" value="2"/> <property name="${service}.Trace.Topic" value="1"/> <property name="${service}.Trace.Subscriber" value="1"/> <property name="${service}.Trace.Election" value="1"/> </properties> </service> </icebox> </server-template> <replica-group id="PublishReplicaGroup"> </replica-group> <replica-group id="TopicManagerReplicaGroup"> <object identity="substation/TopicManager" type="::IceStorm::TopicManager"/> </replica-group> <node name="iceboxNode"> <server-instance template="IceStormTemplate" index="0" node-endpoints="default -h 192.168.3.94 -p 13000" nodes-0="substation/node0:default -h 192.168.3.94 -p 13000" nodes-1="substation/node1:default -h 192.168.3.94 -p 13010" nodes-2="substation/node2:default -h 192.168.3.94 -p 13020" topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000" replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" instance-name="substation" publish-endpoints="tcp -h 192.168.3.94 -p 10001" replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" /> <server-instance template="IceStormTemplate" index="1" node-endpoints="default -h 192.168.3.94 -p 13010" nodes-0="substation/node0:default -h 192.168.3.94 -p 13000" nodes-1="substation/node1:default -h 192.168.3.94 -p 13010" nodes-2="substation/node2:default -h 192.168.3.94 -p 13020" topic-manager-endpoints="tcp -h 192.168.3.94 -p 10010" replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" instance-name="substation" publish-endpoints="tcp -h 192.168.3.94 -p 10011" replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" /> <server-instance template="IceStormTemplate" index="2" node-endpoints="default -h 192.168.3.94 -p 13020" nodes-0="substation/node0:default -h 192.168.3.94 -p 13000" nodes-1="substation/node1:default -h 192.168.3.94 -p 13010" nodes-2="substation/node2:default -h 192.168.3.94 -p 13020" topic-manager-endpoints="tcp -h 192.168.3.94 -p 10020" replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" instance-name="substation" publish-endpoints="tcp -h 192.168.3.94 -p 10021" replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" /> </node> </application> </icegrid>
0 -
Hi,
This is the expected behavior, the IceGrid node deactivates all the servers it manages when it is shutdown.
Cheers,
Benoit.0 -
Hi Benoit,
thank you for the reply,i known it now,thanks.:)0 -
Hi Benoit,
my icebox config xml is already is above #10,nomally the icebox work good,for test, i write a shell for test,the shell is:
#!/bin/sh
while true
do
kill `ps ax|grep substation-2|grep icebox |awk '{print $1}'`
sleep 1
kill `ps ax|grep substation-1|grep icebox |awk '{print $1}'`
sleep 1
kill `ps ax|grep substation-0|grep icebox |awk '{print $1}'`
sleep 1
done
this work 10 minutes,i stop the shell,the icebox log is:-- 07/06/12 15:22:07.120 substation-2-IceStorm: Election: node 2: I have the latest database state. -- 07/06/12 15:22:07.125 substation-0-IceStorm:substation/topic.ThrugoutList subscribers: 9F244485-DA7A-4AD8-9E23-202C1D3A091B endpoints: "tcp -h 192.168.3.109 -p 52710" -- 07/06/12 15:22:07.157 substation-1-IceStorm:substation/topic.ThrugoutList subscribers: 9F244485-DA7A-4AD8-9E23-202C1D3A091B endpoints: "tcp -h 192.168.3.109 -p 52710" -- 07/06/12 15:24:07.176 substation-1-IceStorm: Subscriber: 0xaf0ec0 9F244485-DA7A-4AD8-9E23-202C1D3A091B subscriber errored out: ConnectionI.cpp:1661: Ice::ConnectTimeoutException: timeout while establishing a connection retry: 0/0 -- 07/06/12 15:24:08.060 substation-2-IceStorm: Topic: ThrugoutList: reap 9F244485-DA7A-4AD8-9E23-202C1D3A091B -- 07/06/12 15:24:08.078 substation-0-IceStorm: Topic: ThrugoutList: remove replica observer: 9F244485-DA7A-4AD8-9E23-202C1D3A091B llu: 1/12 -- 07/06/12 15:24:08.078 substation-1-IceStorm: Topic: ThrugoutList: remove replica observer: 9F244485-DA7A-4AD8-9E23-202C1D3A091B llu: 1/12
from above it link the subscribe of ThrugoutList is lost because of ConnectTimeoutException,but i use icestormadmin comman look the topic:ThrugoutList is exist ,how can i resolv the problem ,please help me ,thanks.0 -
the replica state is:
[root@demo109 appsvr]# icestormadmin --Ice.Config=./config.sub Ice 3.4.2 Copyright 2003-2011 ZeroC, Inc. >>> replica replica count: 3 0: id: 0 0: coord: 2 0: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E 0: state: normal 0: group: 0: max: 3 1: id: 1 1: coord: 2 1: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E 1: state: normal 1: group: 1: max: 3 2: id: 2 2: coord: 2 2: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E 2: state: normal 2: group: 0,1 2: max: 3
0 -
Hi,
now i read the Ice-3.4.2 《Ice-Manual.pdf》,it refer:A retry count of adds some resiliency to your IceStorm application by ignoring intermittent network failures such as -1 . However, there is also some risk inherent in using a retry count of because an improperly configured ConnectionRefusedException -1 subscriber may never be removed. For example, consider what happens when a subscriber registers using a transient endpoint: if that subscriber happens to terminate and resubscribe with a different endpoint, IceStorm will continue trying to deliver events to the subscriber at its old endpoint. IceStorm can only remove the subscriber if it receives a hard error, and that is only possible when the subscriber is reachable.
can i use this config? thanks .0 -
Hi,
You should first figure out what is causing this connection timeout exception and see whether or not this is something expected. If this is something expected and which can happen from time to time, you can indeed use the retry count IceStorm QoS to get IceStorm to retry the connection establishment to your subscriber.
Cheers,
Benoit.0 -
Hi Benoit,
thanks reply for the help,i will check the reason ,thanks,:)0 -
Hi Benoit,
i think i find the reason,because of the CLOSE_WAIT cause Ice::ConnectTimeoutException.substation/topic.ThrugoutList subscribers: B069E7B8-E95B-4C95-A46C-A412F64E1A0E endpoints: "tcp -h 192.168.3.109 -p 59636" [root@demo109 appsvr]# netstat -an|grep 59636 tcp 0 0 192.168.3.109:59636 0.0.0.0:* LISTEN tcp 3436 0 192.168.3.109:59636 192.168.3.109:53884 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:53972 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:54023 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:54019 CLOSE_WAIT tcp 0 0 192.168.3.109:54023 192.168.3.109:59636 FIN_WAIT2 tcp 1 0 192.168.3.109:59636 192.168.3.109:54017 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:54018 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:54020 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:54021 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:53896 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:53975 CLOSE_WAIT tcp 1 0 192.168.3.109:59636 192.168.3.109:53887 CLOSE_WAIT
my question is our ice code think about this Situation or need i modify my linux system config(centos 6.2).
thanks.0 -
Hi,
The CLOSE_WAIT indicates that your subscriber isn't closing its connections. The most likely reason is that your subscriber is somehow "hanging" or running into a deadlock that prevents it from dispatching or reading incoming messages. The best would be to attach the debugger to the subscriber and get a thread dump of all its threads to see what the threads are doing.
Cheers,
Benoit.0 -
Hi Benoit,
i now find the reason ,because of my code cause dead lock,i now resolv it,thank you for help。:)0