Archived

This forum has been archived. Please start a new discussion on GitHub.

An Evaluation of ICE/IceGrid

Hi,

I have been evaluating some other frameworks such as Globus and Condor and I am liking ICE so far. Here are some questions I need answers to before I fully decide to go ahead with using ICE for my project which involves distributing work-units to nodes and collecting the results at the end.

(1) Failure Handling.
What happens if the process or machine executing a work-unit crashes? Can the system recover and arrange for that work-unit to be re-executed? What happens if a work-unit returns an incorrect result - is there any redundancy in the system to detect this?

(2) Inter-task communication.
Can work units communicate with one another during a distributed computation, or is inter-work-unit communication not allowed?

Comments

  • matthew
    matthew NL, Canada
    (1) Failure Handling.
    What happens if the process or machine executing a work-unit crashes? Can the system recover and arrange for that work-unit to be re-executed? What happens if a work-unit returns an incorrect result - is there any redundancy in the system to detect this?

    If the server crashes the IceGrid daemon will automatically restart the server (assuming you are using automatic activation) upon the next request. If you have multiple replicas then it may not be activated immediately as the request may go to one of the other replicas. You may need to take additional action in the client depending on exactly what you are doing. For example, the client may need to restart the work-flow when it detects the error.

    If you want redundancy then you have to implement this yourself.
    (2) Inter-task communication.
    Can work units communicate with one another during a distributed computation, or is inter-work-unit communication not allowed?

    This is the whole idea of Ice :) Inter-work communication is not only allowed, but I feel an essential part of any next generation grid.