Members’ Activities: Australian National University
Algorithm-Based Fault Tolerance Framework
The Australian National University (ANU) is contributing to Open Petascale Libraries through the development of a framework for algorithm-based fault tolerance (ABFT) as part of a collaboration with Fujitsu Laboratories of Europe funded by an Australian Research Council (ARC) linkage grant - LP110200410. This work is motivated by the observation that petascale supercomputers already contain hundreds of thousands of computational cores and the transition to exascale is likely to increase this by 2-3 orders of magnitude. As a result, the mean-time-to-failure of components within the system is almost certain to reduce to less than the average time of a simulation. Thus, it will be necessary for applications to be resilient to failure of one or more compute nodes: it must be capable of completing execution (with an accuracy within some specified tolerance).
The ABFT method that is being developed is based on the sparse grid combination method. This method approximates the solution to a problem on a fine computational mesh by computing solutions on several coarser meshes and then combining these to provide a solution of similar accuracy to that which would be expected on the fine mesh. The advantage of this method in the context of fault tolerance is that if a single computational node fails this only has an impact on one of the coarse meshes, allowing the solutions on the other meshes to be combined to approximate the overall solution with only a limited loss of accuracy.
ANU plan initially to exploit the ABFT technology in two HPC application software packages: ANUGA and GENE. ANUGA (developed by ANU and Geoscience Australia) is used to simulate tsunamis and storm surges and GENE is used by ANU researchers to run plasma physics simulations with the aim of furthering understanding of nuclear fusion in the search for sustainable energy. Initial work on ANUGA within OPL has focused on ensuring that it is capable of scaling to many thousands of computational cores in order to allow it to run large-scale simulations of tsunami inundation on petascale supercomputers. For GENE, the initial scalability is much better so this application may be more suitable for initial tests of the ABFT framework for fault resilience.
While initial development and testing of strategies for fault tolerance will be carried out at the application level, the longer term plan is to abstract the ABFT methods into an open-source library that is capable of running on petascale class machines and can be called by any application. This library will be made available via OPL. More details on this work can be found here
International Science Grid This Week's Andrew Purcell interviews Wolfgang Gentzsch
A well-attended Open Petascale Libraries meeting was held in Salt Lake City on 11th November to coincide with the SC12 conference.
"Cosmic Web Stripping" is identified as a new way of explaining the famous missing dwarf problem.
OPL - Impact of partial differential equations on unstructured finite element/volume-based solution.
The University of Tokyo has become the eighteenth member of Open Petascale Libraries.