...failed due to memory allocation errors. The failure caused it to stop passing data but did not properly trigger a gracefulfail over to the redundant system as the memory allocation errors were present on the failover system as well. Clearly...
http://www.computerweekly.com/blogs/stuart_king/2009/01/cloud-outage.html
...Negotiator and Collector, via HAD, and the Schedd, via Schedd Fail-over, can have their state replicated to allow for gracefulfail-over upon service disruption. Database Support: All data about jobs and resources can be stored in a database...
http://www.redhat.com/mrg/grid/features/