Computing Sciences

Application monitoring and checkpointing in HPC: looking towards exascale systems

Document Type

Article

Publication Date

3-29-2012

Abstract

As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in parallel systems is the use of checkpointing.

We demonstrate the impact of sub-optimal checkpoint intervals on application efficiency via simulation with real workload data. We find that application efficiency is relatively insensitive to error in estimation of an application's mean time to interrupt (AMTTI), a parameter central to calculating the optimal checkpoint interval. This result corroborates the trends predicted by previous analytical models. We also find that erring on the side of overestimation may be preferable to underestimation.

We further discuss how application monitoring and resilience frameworks can benefit from this insensitivity to error in AMTTI estimates. Finally, we discuss the importance of application monitoring at exascale and conclude with a discussion of challenges faced in the use of checkpointing at such extreme scales.

Recommended Citation

William M. Jones, John T. Daly, and Nathan DeBardeleben. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. In Proceedings of the 50th annual ACM Southeast Conference (ACMSE '12). Association for Computing Machinery, New York, NY, USA, 262–267. https://doi.org/10.1145/2184512.2184574. Available at https://digitalcommons.coastal.edu/computing/3/

Download

Included in

Computer Sciences Commons

COinS

Computing Sciences

Application monitoring and checkpointing in HPC: looking towards exascale systems

Document Type

Publication Date

Abstract

Recommended Citation

Included in

Browse

Search

Author Corner

Computing Sciences

Application monitoring and checkpointing in HPC: looking towards exascale systems

Authors

Document Type

Publication Date

Abstract

Recommended Citation

Included in

Share

Browse

Search

Author Corner