Archived

This forum has been archived. Please start a new discussion on GitHub.

Master registry locks at startup

I have an Icegrid configuration with 2 registry nodes,a master and a replica.

I had to turn off the computer on which the master registry was running because it was unresponsive.

Since reboot, I am unable to restart the icegridnode process (my main registry is collocated with this node). I doesn't actually fails (with an error) but instead locks. A simple kill does not affect the state of the process.

My Questions:
- Can I remove the registry files ? If I delete them, will the master registry node be able to synchronize from the replica to get the current configuration.
- Is there some way I can avoid this problem so that it doesn't reproduce.

I didn't succeed in provinding a gdb trace but below is the output of running the program with strace:

mprotect(0x7fe1241f9000, 4096, PROT_READ|PROT_WRITE) = 0
mprotect(0x7fe1241fa000, 4096, PROT_READ|PROT_WRITE) = 0
mprotect(0x7fe1241fb000, 4096, PROT_READ|PROT_WRITE) = 0
mprotect(0x7fe1241fc000, 4096, PROT_READ|PROT_WRITE) = 0
mprotect(0x7fe1241fd000, 4096, PROT_READ|PROT_WRITE) = 0
mprotect(0x7fe1241fe000, 4096, PROT_READ|PROT_WRITE) = 0
mprotect(0x7fe1241ff000, 4096, PROT_READ|PROT_WRITE) = 0
sendto(15, "IceP\1\0\1\0\0\0M\0\0\0\2\0\0\0\24RegistryTopicM"..., 77, 0, NULL, 0) = 77
futex(0x7fff2e3e956c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7fff2e3e9598, FUTEX_WAKE_PRIVATE, 1) = 0
open("/dev/urandom", O_RDONLY) = 26
read(26, "\36@\27\334\270\35/\307\375i{\365K\30Bu"..., 16) = 16
sendto(15, "IceP\1\0\1\0\0\0Q\0\0\0\3\0\0\0\24RegistryTopicM"..., 81, 0, NULL, 0) = 81
futex(0x7fff2e3e95cc, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7fff2e3e95f8, FUTEX_WAKE_PRIVATE, 1) = 0
sendto(15, "IceP\1\0\1\0\0\0T\0\0\0\4\0\0\0\24RegistryTopicM"..., 84, 0, NULL, 0) = 84
futex(0x7fff2e3e928c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7fff2e3e92b8, FUTEX_WAKE_PRIVATE, 1) = 0
sendto(15, "IceP\1\0\1\0\0\0P\0\0\0\5\0\0\0\24RegistryTopicM"..., 80, 0, NULL, 0) = 80
futex(0x7fff2e3e952c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7fff2e3e9558, FUTEX_WAKE_PRIVATE, 1) = 0
sendto(15, "IceP\1\0\1\0\0\0O\0\0\0\6\0\0\0\24RegistryTopicM"..., 79, 0, NULL, 0) = 79
futex(0x7fff2e3e952c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7fff2e3e9558, FUTEX_WAKE_PRIVATE, 1) = 0
pread(22, "\2\0\0\0|l\16\0\1\0\0\0\0\0\0\0\0\0\0\0\f\0\344\f\1\5\354\17\204\17p\17\10"..., 4096, 4096) = 4096
sendto(15, "IceP\1\0\1\0\0\0W\0\0\0\7\0\0\0\30RegistryObserv"..., 87, 0, NULL, 0) = 87
futex(0x7fff2e3e94cc, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7fff2e3e94f8, FUTEX_WAKE_PRIVATE, 1) = 0
pread(24, "\2\0\0\0#\v\227\0\1\0\0\0\0\0\0\0\0\0\0\0\"\0\204\10\1\5\34\r\324\f\274\ft"..., 4096, 4096) = 4096
socket(PF_NETLINK, SOCK_RAW, 0) = 27
bind(27, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
getsockname(27, {sa_family=AF_NETLINK, pid=4417, groups=00000000}, [12]) = 0
sendto(27, "\24\0\0\0\26\0\1\3\20$XK\0\0\0\0\0\0\0\0"..., 20, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 20
recvmsg(27, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"0\0\0\0\24\0\2\0\20$XKA\21\0\0\2\10\200\376\1\0\0\0\10\0\1\0\177\0\0\1\10"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 108
recvmsg(27, {msg_name(12)={sa_family=AF_NETLINK, pid=0, groups=00000000}, msg_iov(1)=[{"\24\0\0\0\3\0\2\0\20$XKA\21\0\0\0\0\0\0\1\0\0\0\10\0\1\0\177\0\0\1\10"..., 4096}], msg_controllen=0, msg_flags=0}, 0) = 20
close(27) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 27
setsockopt(27, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(27, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
fcntl(27, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(27, F_SETFL, O_RDWR|O_NONBLOCK) = 0
connect(27, {sa_family=AF_INET, sin_port=htons(39984), sin_addr=inet_addr("172.19.64.4")}, 16) = -1 EINPROGRESS (Operation now in progress)
getsockname(27, {sa_family=AF_INET, sin_port=htons(51209), sin_addr=inet_addr("172.19.0.4")}, [16]) = 0
getpeername(27, 0x7fff2e3e9200, [115964117120]) = -1 ENOTCONN (Transport endpoint is not connected)
epoll_ctl(11, EPOLL_CTL_ADD, 27, {EPOLLOUT, {u32=605314336, u64=140604949683488}}) = 0
futex(0x7fe1241f618c, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
futex(0x7fe1241f61b8, FUTEX_WAKE_PRIVATE, 1) = 0
sendto(27, "IceP\1\0\1\0\0\0\233\0\0\0\1\0\0\0!InternalRegist"..., 155, 0, NULL, 0) = 155
futex(0x7fff2e3e9c3c, FUTEX_WAIT_PRIVATE, 1, NULL^C <unfinished ...>

Comments

  • benoit
    benoit Rennes, France
    Hi Julien

    You shouldn't remove the master registry files (at least not all of them): the master is the authority and therefore it is the slaves that retrieve the data from the master. If you clear the master database, the slave databases will be cleared as well. Instead you should promote the slave to become the master and restart the old master as a slave. There are instructions in the manual on how to do this, here.

    In any case, before deleting anything, I recommend trying figuring out why it hangs. I suspect it's simply a problem with the registry or node trying to connect to an endpoint for which no servers are listening anymore (in theory this fails immediately but perhaps you have some firewall setup that instead cause connection establishment to hang in this scenario?).

    Did you try to enable tracing? I recommend trying with the following tracing properties: Ice.Trace.Network=2, IceGrid.Registry.Trace.Node=2, IceGrid.Registry.Trace.Replica=2, IceGrid.Node.Trace.Replica=2.

    You should also check your configuration and make sure you are using timeouts on the registry/node endpoints and proxy endpoints. Without timeouts, network connections under some circumstances can take a very long time to fail and cause hangs. See the demo/IceGrid/replication registry and node configuration files for an example on how to set timeouts (look for the "-t" options in the endpoints). You could also simply try to start the registry with --Ice.Override.Timeout=10000 and see if it helps.

    Cheers,
    Benoit.
  • Gosh,

    hopefully I did not try to remove the master files.

    The master registry trying to poll adapters registered before the shutdown was probablity the issue.

    With Ice.Override.Timeout set to 1ms, the registry node started successfully.

    Thank you,

    Julien