Archived

This forum has been archived. Please start a new discussion on GitHub.

icegridnode manage programe problem

I use ice 3.4.2 icegridregistry and icegridnode manage my programe,icegridregistry config is:
IceGrid.InstanceName=substation
IceGrid.Registry.Client.Endpoints=default -h 192.168.3.94 -p 4061 -t 5000
IceGrid.Registry.Server.Endpoints=default -h 192.168.3.94 -t 5000
IceGrid.Registry.Internal.Endpoints=default -h 192.168.3.94 -t 5000
IceGrid.Registry.Data=iceDb/registry
IceGrid.Registry.PermissionsVerifier=substation/NullPermissionsVerifier
IceGrid.Registry.AdminPermissionsVerifier=substation/NullPermissionsVerifier
IceGrid.Registry.NodeSessionTimeout=10
IceGrid.Registry.ReplicaSessionTimeout=10
IceGrid.Registry.Trace.Adapter=0
IceGrid.Registry.Trace.Application=0
IceGrid.Registry.Trace.Locator=0
IceGrid.Registry.Trace.Node=1
IceGrid.Registry.Trace.Object=0
IceGrid.Registry.Trace.Patch=0
IceGrid.Registry.Trace.Replica=0
IceGrid.Registry.Trace.Server=1
IceGrid.Registry.Trace.Session=0
Ice.Default.Locator=substation/Locator:tcp -h 192.168.3.94 -p 4061

icegridnode config is:
Ice.Default.Locator=substation/Locator: default -h 192.168.3.94 -p 4061 -t 5000
IceGrid.Node.Name=datasvrNode
IceGrid.Node.Endpoints=default -t 5000
IceGrid.Node.Data=iceDb/datasvr
xml config is:
<icegrid>
<application  name="datasvr">
<node name="datasvrNode">
<server id="datasvrServer" exe="sudo" activation="always">
<option>/gridnt/bin/datasvr/datasvr</option>
<option>-flagfile=/gridnt/bin/datasvr/datasvr.cfg</option>
<adapter name="datasvrAdapter" id="datasvrAdapter" endpoints="tcp -h 192.168.3.94"/>
<property name="Subscriber.Endpoints" value="tcp -h 192.168.3.94"/>
<property name="Ice.MessageSizeMax" value="10240"/>
<property name="TopicManager.Proxy" value="substation/TopicManager:tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020"/>
</server>
</node>
</application>
</icegrid>

general this work good,I use "pkill -9 datasvr" command kill the datasvr programe,it can restart ,but when I use pkill command quickly and >=10 count,the programe can't restart.datasvrServer status is inactive,must server enable can restart the programe.

my question is why cause it ,how can i resolv it ,please tell me ,thanks.

Comments

  • benoit
    benoit Rennes, France
    Hi,

    Can you add IceGrid.Node.Trace.Server=2 and IceGrid.Node.Trace.Activator=2 to your IceGrid node configuration file, try again and post the traces here?

    Cheers,
    Benoit.
  • thank you for the reply,above is the strace log:
    -- 06/15/12 17:28:50.067 icegridnode: Server: changed server `gooseServer' state to `Active'
    -- 06/15/12 17:28:50.126 icegridnode: Activator: detected termination of server `gooseServer'
       exit code = 137
    -- 06/15/12 17:28:50.126 icegridnode: Server: changed server `gooseServer' state to `Inactive'
    -- 06/15/12 17:28:50.626 icegridnode: Activator: activating server `gooseServer'
       path = sudo
       pwd = /gridnt/bin/collector
       uid/gid = 99/99
       args = sudo /gridnt/bin/collector/goose -flagfile=/gridnt/bin/collector/goose.cfg --Ice.Config=/gridnt/bin/collector/iceDb/goose/servers/gooseServer/config/config
    -- 06/15/12 17:28:50.681 icegridnode: Server: changed server `gooseServer' state to `Active'
    -- 06/15/12 17:28:51.308 icegridnode: Activator: detected termination of server `gooseServer'
       exit code = 137
    -- 06/15/12 17:28:51.309 icegridnode: Server: changed server `gooseServer' state to `Inactive'
    -- 06/15/12 17:28:51.809 icegridnode: Activator: activating server `gooseServer'
       path = sudo
       pwd = /gridnt/bin/collector
       uid/gid = 99/99
       args = sudo /gridnt/bin/collector/goose -flagfile=/gridnt/bin/collector/goose.cfg --Ice.Config=/gridnt/bin/collector/iceDb/goose/servers/gooseServer/config/config
    -- 06/15/12 17:28:51.838 icegridnode: Activator: detected termination of server `gooseServer'
       exit code = 137
    -- 06/15/12 17:28:51.843 icegridnode: Server: changed server `gooseServer' state to `Inactive'
    
    
  • benoit
    benoit Rennes, France
    The traces don't show something wrong. So at this point the server doesn't restart on-demand? What does the icegridadmin "server state <server id>" command returns? Are you able to reproduce this with the IceGrid simple demo? Can you also specify on which platform this is occurring?

    Cheers,
    Benoit.
  • i use centos 6.2 x86 64bit system,use icegridadmin check state is:
    bash-4.1# icegridadmin -uadmin -padmin -e "server state gooseServer" --Ice.Config=/gridnt/bin/collector/icegridcnf/goose.node
    inactive (disabled)
    
    i will try IceGrid simple demo later,thanks for the reply.
  • Hi,
    I try the icegrid simple case is right,before the server is run as nobody user,today i modify ice config make the server run as root user,now the server can auto restart,i think this situation is because of our programe or system?
  • benoit
    benoit Rennes, France
    Hi,

    Are you using the "always" activation mode for your server? IceGrid will automatically disable a server using the "always" activation mode if this server keeps dying before turning to the "Active" state. This ensure that mis-behaving servers don't keep spawning and dying indefinitely.

    Cheers,
    Benoit.
  • yes ,i use the using the "always" activation mode.
  • benoit
    benoit Rennes, France
    Hi,

    Ok that explains it. If you kill the process before it's being active and if its activation mode is "always", the IceGrid node will automatically disable it to prevent the process to spawn indefinitely in case the process is mis-behaving during startup.

    Cheers,
    Benoit.
  • Hi Benoit,
    thank you for the reply,i now understand this question. but recently,i come across another question:for test ,when i kill the icebox icegridnode process,the icebox server process will auto exit,that is why?
    my icebox config file is :
    <icegrid>
    <application name="icebox">
    
    <server-template id="IceStormTemplate">
    <parameter name="index"/>
    <parameter name="node-endpoints"/>
    <parameter name="nodes-0"/>
    <parameter name="nodes-1"/>
    <parameter name="nodes-2"/>
    <parameter name="topic-manager-endpoints"/>
    <parameter name="replicated-topic-manager-endpoints"/>
    <parameter name="instance-name"/>
    <parameter name="publish-endpoints"/>
    <parameter name="replicated-publish-endpoints"/>
    
    <icebox id="${instance-name}-${index}" exe="icebox" activation="always">
    
    <service name="IceStorm" entry="IceStormService,34:createIceStorm">
    
    <dbenv name="${service}"/>
    
    <properties>
    <property name="${service}.NodeId" value="${index}"/>
    <property name="${service}.Node.Endpoints" value="${node-endpoints}"/>
    <property name="${service}.Nodes.0" value="${nodes-0}"/>
    <property name="${service}.Nodes.1" value="${nodes-1}"/>
    <property name="${service}.Nodes.2" value="${nodes-2}"/>
    <property name="${service}.TopicManager.Endpoints" value="${topic-manager-endpoints}"/>
    <property name="${service}.ReplicatedTopicManagerEndpoints" value="${replicated-topic-manager-endpoints}"/>
    <property name="${service}.InstanceName" value="${instance-name}"/>
    <property name="${service}.Publish.Endpoints" value="${publish-endpoints}"/>
    <property name="${service}.ReplicatedPublishEndpoints" value="${replicated-publish-endpoints}"/>
    
    <property name="${service}.Trace.TopicManager" value="2"/>
    <property name="${service}.Trace.Topic" value="1"/>
    <property name="${service}.Trace.Subscriber" value="1"/>
    <property name="${service}.Trace.Election" value="1"/>
    </properties>
    </service>
    </icebox>
    </server-template>
    
    <replica-group id="PublishReplicaGroup">
    </replica-group>
    
    <replica-group id="TopicManagerReplicaGroup">
    <object identity="substation/TopicManager" type="::IceStorm::TopicManager"/>
    </replica-group>
    
    <node name="iceboxNode">
    <server-instance template="IceStormTemplate" index="0" 
    node-endpoints="default  -h 192.168.3.94 -p 13000" 
    nodes-0="substation/node0:default  -h 192.168.3.94 -p 13000"
    nodes-1="substation/node1:default  -h 192.168.3.94 -p 13010"
    nodes-2="substation/node2:default  -h 192.168.3.94 -p 13020"
    topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000" 
    replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" 
    instance-name="substation" 
    publish-endpoints="tcp -h 192.168.3.94 -p 10001" 
    replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" 
    />
    <server-instance template="IceStormTemplate" index="1" 
    node-endpoints="default  -h 192.168.3.94 -p 13010" 
    nodes-0="substation/node0:default  -h 192.168.3.94 -p 13000"
    nodes-1="substation/node1:default  -h 192.168.3.94 -p 13010"
    nodes-2="substation/node2:default  -h 192.168.3.94 -p 13020"
    topic-manager-endpoints="tcp -h 192.168.3.94 -p 10010" 
    replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" 
    instance-name="substation" 
    publish-endpoints="tcp -h 192.168.3.94 -p 10011" 
    replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" 
    />
    <server-instance template="IceStormTemplate" index="2" 
    node-endpoints="default  -h 192.168.3.94 -p 13020" 
    nodes-0="substation/node0:default  -h 192.168.3.94 -p 13000"
    nodes-1="substation/node1:default  -h 192.168.3.94 -p 13010"
    nodes-2="substation/node2:default  -h 192.168.3.94 -p 13020"
    topic-manager-endpoints="tcp -h 192.168.3.94 -p 10020" 
    replicated-topic-manager-endpoints="tcp -h 192.168.3.94 -p 10000:tcp -h 192.168.3.94 -p 10010:tcp -h 192.168.3.94 -p 10020" 
    instance-name="substation" 
    publish-endpoints="tcp -h 192.168.3.94 -p 10021" 
    replicated-publish-endpoints="tcp -h 192.168.3.94 -p 10001:tcp -h 192.168.3.94 -p 10011:tcp -h 192.168.3.94 -p 10021" 
    />
    </node>
    </application>
    </icegrid>
    
    
  • benoit
    benoit Rennes, France
    Hi,

    This is the expected behavior, the IceGrid node deactivates all the servers it manages when it is shutdown.

    Cheers,
    Benoit.
  • Hi Benoit,
    thank you for the reply,i known it now,thanks.:)
  • Hi Benoit,
    my icebox config xml is already is above #10,nomally the icebox work good,for test, i write a shell for test,the shell is:
    #!/bin/sh

    while true
    do
    kill `ps ax|grep substation-2|grep icebox |awk '{print $1}'`
    sleep 1
    kill `ps ax|grep substation-1|grep icebox |awk '{print $1}'`
    sleep 1
    kill `ps ax|grep substation-0|grep icebox |awk '{print $1}'`
    sleep 1
    done

    this work 10 minutes,i stop the shell,the icebox log is:
    -- 07/06/12 15:22:07.120 substation-2-IceStorm: Election: node 2: I have the latest database state.
    -- 07/06/12 15:22:07.125 substation-0-IceStorm:substation/topic.ThrugoutList subscribers: 9F244485-DA7A-4AD8-9E23-202C1D3A091B endpoints: "tcp -h 192.168.3.109 -p 52710"
    -- 07/06/12 15:22:07.157 substation-1-IceStorm:substation/topic.ThrugoutList subscribers: 9F244485-DA7A-4AD8-9E23-202C1D3A091B endpoints: "tcp -h 192.168.3.109 -p 52710"
    
    
    -- 07/06/12 15:24:07.176 substation-1-IceStorm: Subscriber: 0xaf0ec0 9F244485-DA7A-4AD8-9E23-202C1D3A091B subscriber errored out: ConnectionI.cpp:1661: Ice::ConnectTimeoutException:
       timeout while establishing a connection retry: 0/0
    
    
    -- 07/06/12 15:24:08.060 substation-2-IceStorm: Topic: ThrugoutList: reap 9F244485-DA7A-4AD8-9E23-202C1D3A091B
    -- 07/06/12 15:24:08.078 substation-0-IceStorm: Topic: ThrugoutList: remove replica observer: 9F244485-DA7A-4AD8-9E23-202C1D3A091B llu: 1/12
    -- 07/06/12 15:24:08.078 substation-1-IceStorm: Topic: ThrugoutList: remove replica observer: 9F244485-DA7A-4AD8-9E23-202C1D3A091B llu: 1/12
    
    from above it link the subscribe of ThrugoutList is lost because of ConnectTimeoutException,but i use icestormadmin comman look the topic:ThrugoutList is exist ,how can i resolv the problem ,please help me ,thanks.
  • the replica state is:
    [root@demo109 appsvr]# icestormadmin --Ice.Config=./config.sub
    Ice 3.4.2  Copyright 2003-2011 ZeroC, Inc.
    >>> replica
    replica count: 3
    0: id:         0
    0: coord:      2
    0: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E
    0: state:      normal
    0: group:
    0: max:        3
    1: id:         1
    1: coord:      2
    1: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E
    1: state:      normal
    1: group:
    1: max:        3
    2: id:         2
    2: coord:      2
    2: group name: 2:D811F919-7C5C-423E-807D-2846E2208B7E
    2: state:      normal
    2: group:      0,1
    2: max:        3
    
  • Hi,
    now i read the Ice-3.4.2 《Ice-Manual.pdf》,it refer:
    A retry count of  adds some resiliency to your IceStorm application by ignoring intermittent network failures such as  -1
    . However, there is also some risk inherent in using a retry count of  because an improperly configured ConnectionRefusedException -1
    subscriber may never be removed. For example, consider what happens when a subscriber registers using a transient endpoint: if that
    subscriber happens to terminate and resubscribe with a different endpoint, IceStorm will continue trying to deliver events to the subscriber at
    its old endpoint. IceStorm can only remove the subscriber if it receives a hard error, and that is only possible when the subscriber is
    reachable.
    
    can i use this config? thanks .
  • benoit
    benoit Rennes, France
    Hi,

    You should first figure out what is causing this connection timeout exception and see whether or not this is something expected. If this is something expected and which can happen from time to time, you can indeed use the retry count IceStorm QoS to get IceStorm to retry the connection establishment to your subscriber.

    Cheers,
    Benoit.
  • Hi Benoit,
    thanks reply for the help,i will check the reason ,thanks,:)
  • Hi Benoit,
    i think i find the reason,because of the CLOSE_WAIT cause Ice::ConnectTimeoutException.
    substation/topic.ThrugoutList subscribers: B069E7B8-E95B-4C95-A46C-A412F64E1A0E endpoints: "tcp -h 192.168.3.109 -p 59636"
    
    [root@demo109 appsvr]# netstat -an|grep 59636
    tcp        0      0 192.168.3.109:59636         0.0.0.0:*                   LISTEN
    tcp     3436      0 192.168.3.109:59636         192.168.3.109:53884         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:53972         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:54023         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:54019         CLOSE_WAIT
    tcp        0      0 192.168.3.109:54023         192.168.3.109:59636         FIN_WAIT2
    tcp        1      0 192.168.3.109:59636         192.168.3.109:54017         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:54018         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:54020         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:54021         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:53896         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:53975         CLOSE_WAIT
    tcp        1      0 192.168.3.109:59636         192.168.3.109:53887         CLOSE_WAIT
    
    

    my question is our ice code think about this Situation or need i modify my linux system config(centos 6.2).
    thanks.
  • benoit
    benoit Rennes, France
    Hi,

    The CLOSE_WAIT indicates that your subscriber isn't closing its connections. The most likely reason is that your subscriber is somehow "hanging" or running into a deadlock that prevents it from dispatching or reading incoming messages. The best would be to attach the debugger to the subscriber and get a thread dump of all its threads to see what the threads are doing.

    Cheers,
    Benoit.
  • Hi Benoit,
    i now find the reason ,because of my code cause dead lock,i now resolv it,thank you for help。:)