iSGTW - International Science Grid This Week
iSGTW - International Science Grid This Week

Home > iSGTW 27 June 2007 > iSGTW Feature - Adaptive fault tolerance for improved reliability


Feature - Adaptive fault tolerance for improved reliability

Image generated using cosmology modeling software Enzo. FT-Pro is designed to help programs like Enzo run for longer by adaptively selecting a best‐fit action in response to failure prediction.
Image courtesy of Mike Norman at University of California, San Diego and Greg Bryan at Columbia University, New York, U.S. 
So there’s the good news. And then there’s the bad news.

The good news is that high performance computing systems are getting bigger.

And the bad news? As system size increases, Mean Time Before Failure is dramatically reduced: the number of hours you can run your application before everything grinds to a halt just keeps getting smaller.

Yawei Li and Zhiling Lan of the Illinois Institute of Technology, U.S., want to change all that.

They have developed an adaptive fault management scheme, called FT-Pro, which has already improved the robustness of several real-world applications run on the TeraGrid, including Enzo, a software package for simulating cosmological structures, and GROMACS, a molecular dynamics package for studying molecular interactions.

“Applications like these are getting larger, running for longer, and using more processors,” Lan says. “But, since just one process failure can crash your entire application, these applications are extremely vulnerable to failure.”

The usual solution, says Lan, is either to undertake regular reactive checkpointing, or to be proactive and predict potential failures before they occur. Both options are fraught with complications.

“Regular checkpointing results in substantial performance overhead, while predicting failure can be very hit-and-miss. We wanted something that could combine the best of both these approaches.”

FT-Pro is Lan’s solution. The program works in conjunction with regular failure management tools, but introduces the flexibility of adaptive decision making: FT-Pro can make runtime decisions based on a user’s fault tolerance requests. 

Zhiling Lan is working to increase Mean Time Before Failure by introducing adaptive fault tolerance.
Image courtesy of Zhiling Lan
“We would like to see FT-Pro used to help avoid anticipated failures, and to help applications tolerate unforeseeable failures, so that the impact of any failure is kept to a minimum,” explains Lan.

The system works by allocating a couple of spare nodes, used as an extra hand to juggle jobs on and off nodes where failure is predicted.

Usually kept idle, these spare nodes provide the luxury of migration away from failing nodes, buying some downtime for their recovery or restart, and thus minimizing application execution times.

Trace-based experiments on the IA32 Linux cluster at Argonne National Laboratory (part of TeraGrid) have indicated that FT-Pro can effectively improve the performance of parallel applications in the presence of failures by avoiding anticipated failures and skipping unnecessary fault tolerance overhead.

For example, when running Enzo, using FT-Pro on the 96-node IA32 TeraGrid/ANL cluster reduced application completion time by up to 43%, as compared to when purely relying on periodic checkpointing.

FT-Pro is supported in part by the United States National Science Fund, IIT startup fund, and TeraGrid Wide-Roaming Allocation.

- Cristy Burne, iSGTW



 iSGTW 22 December 2010

Feature – Army of Women allies with CaBIG for online longitudinal studies

Special Announcement - iSGTW on Holiday

Video of the Week - Learn about LiDAR


NeHC launches social media

PRACE announces third Tier-0 machine

iRODS 2011 User Group Meeting

Jobs in distributed computing


Enter your email address to subscribe to iSGTW.


 iSGTW Blog Watch

Keep up with the grid’s blogosphere

 Mark your calendar

December 2010

13-18, AGU Fall Meeting

14-16, UCC 2010

17, ICETI 2011 and ICSIT 2011

24, Abstract Submission deadline, EGI User Forum


January 2011

11, HPCS 2011 Submission Deadline

11, SPCloud 2011

22, ALENEX11

30 Jan – 3 Feb, ESCC/Internet2


February 2011

1 - 4, GlobusWorld '11

2, Lift 11

15 - 16, Cloudscape III

More calendar items . . .


FooterINFSOMEuropean CommissionDepartment of EnergyNational¬†Science¬†Foundation RSSHeadlines | Site Map