Master/Slave Replica Endpoint config

jharriot · January 2011

Hi,

Configuration:

Ice: version 3.3.1
OS: Win XP, RHEL 5.2

Query 1:

I have configured 7 PCs with one set as MASTER and the others as REPLICAs.

Master = 192.168.1.10
Replica1 = 192.168.1.20
Replica2 = 192.168.1.30
...
Replica6 = 192.168.1.70

I have set each IceGrid Node config file Default Locator with all 7 TCP endpoints.

e.g. Ice.Default.Locator=DemoIceGrid/Locator:tcp -h 192.168.1.10 -p 4061:tcp -h 192.168.1.20 -p 4061 ....etc

I guess I should also add timeouts for each endpoint?

I'm not quite sure what values I should specify for:
IceGrid.Registry.Client.Endpoints
IceGrid.Registry.Server.Endpoints
IceGrid.Registry.Internal.Endpoints

Should I only specify the endpoint for the registry host, or should I include endpoints for the other registry hosts?

eg. For the Master:

IceGrid.Registry.Client.Endpoints=tcp -p 12000 -h 192.168.1.10
IceGrid.Registry.Server.Endpoints=tcp -h 192.168.1.10
IceGrid.Registry.Internal.Endpoints=tcp -h 192.168.1.10

Should I include timeouts for all endpoints?

I assume a similar approach is used for IceGrid.Node.Endpoints to only use the specified network interface.

N.B. I wish to minimise any network delays by excluding unused network interfaces.

Query 2:

If I run my system with only the MASTER node active, my client application on the MASTER node is very inconsistent in startup delay, i.e. from instantaneous up to 30sec. If I reduce the Default Locator to a single endpoint (i.e. the Master) the application is very responsive to startup. Can you explain what might be causing this behaviour? i.e. when there is a list of locator endpoints does Ice randomly attempt to use a Registry from the list which may not be active? What trace attributes could help to diagnose the problem?

Query 3:

In a proper functioning system is it detrimental to start a REPLICA before the MASTER. I assume the first REPLICA node that starts will assume the role as MASTER until the system is restarted? Or, is it best practice to always start the MASTER first? Ideally we would like our operators to use any combination of PCs for a given session/day, and not be required to always use the MASTER.

Cheers John

benoit · January 2011

Hi John,

jharriot wrote: »

Hi,

Configuration:
Ice: version 3.3.1
OS: Win XP, RHEL 5.2

Query 1:
I have configured 7 PCs with one set as MASTER and the others as REPLICAs.

Master = 192.168.1.10
Replica1 = 192.168.1.20
Replica2 = 192.168.1.30
...
Replica6 = 192.168.1.70

I have set each IceGrid Node config file Default Locator with all 7 TCP endpoints.

e.g. Ice.Default.Locator=DemoIceGrid/Locator:tcp -h 192.168.1.10 -p 4061:tcp -h 192.168.1.20 -p 4061 ....etc

I guess I should also add timeouts for each endpoint?

Yes, otherwise without timeouts the Ice invocations might block for quite sometime before failing (depending on the network and the reasons why the server endpoint doesn't respond).

I'm not quite sure what values I should specify for:
IceGrid.Registry.Client.Endpoints
IceGrid.Registry.Server.Endpoints
IceGrid.Registry.Internal.Endpoints

Should I only specify the endpoint for the registry host, or should I include endpoints for the other registry hosts?

These properties specify the endpoints the registry will listen on. They are not used to connect to other services. So yes, you should only specify endpoints for the host where the registry is running. You can specify multiple endpoints if the host has multiple network interfaces.

eg. For the Master:

IceGrid.Registry.Client.Endpoints=tcp -p 12000 -h 192.168.1.10
IceGrid.Registry.Server.Endpoints=tcp -h 192.168.1.10
IceGrid.Registry.Internal.Endpoints=tcp -h 192.168.1.10

Should I include timeouts for all endpoints?

These endpoints look fine and yes you should also specify timeouts for them (see the C++ demo/IceGrid/replication demo included with your Ice distribution for an example where timeouts are set).

I assume a similar approach is used for IceGrid.Node.Endpoints to only use the specified network interface.

Correct.

N.B. I wish to minimise any network delays by excluding unused network interfaces.

Specifying endpoints with a -h <interface> option ensures that Ice will only listen on this interface and it will only "generate" proxies containing those endpoints.

Query 2:
If I run my system with only the MASTER node active, my client application on the MASTER node is very inconsistent in startup delay, i.e. from instantaneous up to 30sec. If I reduce the Default Locator to a single endpoint (i.e. the Master) the application is very responsive to startup. Can you explain what might be causing this behaviour? i.e. when there is a list of locator endpoints does Ice randomly attempt to use a Registry from the list which may not be active? What trace attributes could help to diagnose the problem?

Yes, your Ice client most likely tries to connect to unreachable endpoints before trying one that works. You can set Ice.Trace.Network=2 to diagnose the connection establishments. By default, the Ice runtime randomly selects endpoint for connection establishment. This can be changed with the <proxy>.EndpointSelection property. For example, you could set the following for the locator proxy:

Ice.Default.Locator.EndpointSelection=Ordered

See 36.3.1 Endpoint Selection for more information.

You could also try to reduce the connection establishment to minimize the problem (using the Ice.Override.ConnectTimeout property). You could also consider reducing the number of endpoints since it seems quite unlikely that 6 among 7 replicas can be down at the same time.

Query 3:
In a proper functioning system is it detrimental to start a REPLICA before the MASTER. I assume the first REPLICA node that starts will assume the role as MASTER until the system is restarted? Or, is it best practice to always start the MASTER first? Ideally we would like our operators to use any combination of PCs for a given session/day, and not be required to always use the MASTER.

Starting slaves before the master is fine. Note however that this slave won't assume the role of master. The role of master is only assumed by the IceGrid registry that is configured to be a master (i.e.: the one whose IceGrid.Registry.ReplicaName property is NOT set or set to "Master").

I recommend reading the following two sections in the IceGrid chapter for more information on the replication of the registry and slave promotion: 38.12 Registry Replication and 38.21.5 Slave Promotion.

Cheers,
Benoit.

jharriot · April 2011

Hi,

I have applied the recommended changes and I am still puzzled. I have attached a couple of documents to help describe my configuration. The PDF file is an example of Windows and Linux IceGrid config files and a screen shot of a ProcessManager tool used to examine node activity (similar to IceGridmanager java app). The excel spreadsheet summarises the key properties across all nodes. Unfortuantely the attachment upload facility keeps rejecting my upload attempts. Perhaps send me and email and I will return include zip file.

To summarise my configuration:
7 Stations. Each station has a Windows and Linux node.

To summarise the IceGrid locator:
Windows Nodes: 1 Node is the Master and 6 others are replicas.
Linux Nodes: Each node looks to its companion Windows node as the default locator.

The intention with this configuration is to allow one station (Windows + Linux nodes) to function standalone. The Master node is used to deploy the IceGrid configuration.

I have observed the following behaviours:

Scenario 1:
If I power on all stations, wait for a minute or so and then launch the ProcessManager on the Master station, all nodes are visible in the ProcessManager node list.

Scenario 2:
If I power on only Station 1 it takes about 5 minutes for its nodes to appear in the ProcessManager. If I power on Station 2 and its ProcessManager I have to wait for another few minutes before the Station 2 nodes appear in the ProcessManager. It is also noticeable that the node lists are not identical on stations 1 and 2 ProcessManagers.

Why is it taking so long for each IceGridRegistry to discover the nodes?
During the few minutes of delay I can't even connect the ProcessManager to the IceGridregistry. Its as if its blocking any session connections while its doing something?

Due to the nature of the system, the number of stations powered on will vary from 1 to 7. So it doesn't seem appropriate to power on 7 stations to only use one or two.

Any thoughts on kow I can achieve a flexible solution would be appreciated?

Cheers John

jharriot · April 2011

Hi,

Using Scenario 2 from the previous posting I examined the Ice.Trace.Network=2 output on the replica node and noticed it repeatedly failed connections to its own collocated registry service. It also attempted to connect to other endpoints on the list, however because they were inactive connection failures were expected.

I then reduced the default locator endpoint list to contain a single host address (its own local registry). The trace log still showed failures to connect to itself, but more puzzling were attempts to contact other registries not listed in the default locator property. Its as if the change to the default locator in the config file has been ignored to some extent?

Cheers John

benoit · April 2011

Hi John,

When a slave registry starts up, it first tries to connect to the master before accepting requests. When the master starts up, it first tries to connect with all its "known" slaves before accepting requests (known slaves are slaves that previously connected with it, those slaves are saved in an internal database). Finally, when a node starts up, it tries to connect to registry replicas before accepting requests (it finds those registry replicas by querying the locator defined with the Ice.Default.Locator property).

The way those connections are attempted is not really optimal right now and could be improved (for example, the master will try connecting to each slave one after the other... if there are many slaves which are down it can take a significant time to try connecting to all of them). We will look into improving this for the next release, if you are interested in testing those changes, I can send you a source patch.

In the meantime, there's few things you can try to help minimize those delays:

set Ice.Override.ConnectTimeout to a low value. If all your machines are on the same local network, a timeout of 2s to 5s should be good enough.
set a lower node/registry session timeout, for example: IceGrid.Registry.NodeSessionTimeout=15 and IceGrid.Registry.ReplicaSessionTimeout=15

Perhaps you should also consider simplifying your deployment. If a station is supposed to be standalone, do all stations need to share the same IceGrid registry? Would it perhaps be simpler if each station had its own master registry?

Cheers,
Benoit.

jharriot · April 2011

Hi,

Thanks for reply.

As the stations are not always standalone I will look at adopting a single master and slave combination to avoid a single point of failure.

Cheers John

Archived

Master/Slave Replica Endpoint config

Comments

Categories