What is the best setup for minimizing the effect when a grid node is down
When a grid node is down (crush, maintenance) what is the best strategy to minimize its effect on the clients (and save useless connection establishment tries). I should mention that we would like to distribute the load across the replicated nodes as much as possible and therefore would prefer not to cache the connection on the proxy and use the random selection strategy (as described in the manual section 37.3). I am aware that we can configure a frequent registry lookup to refresh the endpoints but wonder how quick the registry will be aware of the unavailability of the node and will it not include its endpoints. Also I wonder if there is a concept of "bad endpoint" which are not used temporarily in case of consecutive failures . Any other suggestions?