Document Type
Article
Publication Date
3-29-2012
Abstract
As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing.
We demonstrate the impact of sub-optimal checkpoint intervals on application efficiency via simulation with real workload data. We find that application efficiency is relatively insensitive to error in estimation of an application's mean time to interrupt (AMTTI), a parameter central to calculating the optimal checkpoint interval. This result corroborates the trends predicted by previous analytical models. We also find that erring on the side of overestimation may be preferable to underestimation.
We further discuss how application monitoring and resilience frameworks can benefit from this insensitivity to error in AMTTI estimates. Finally, we discuss the importance of application monitoring at exascale and conclude with a discussion of challenges faced in the use of checkpointing at such extreme scales.
Recommended Citation
William M. Jones, John T. Daly, and Nathan DeBardeleben. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. In Proceedings of the 50th annual ACM Southeast Conference (ACMSE '12). Association for Computing Machinery, New York, NY, USA, 262–267. https://doi.org/10.1145/2184512.2184574. Available at https://digitalcommons.coastal.edu/computing/3/