## Projects

ADAC Projects

## Applications Working Group

*Tjerk P. Straatsma and Coen de Graaf*

GronOR is a non-orthogonal configuration interaction (NOCI) application, developed by the University of Groningen, Oak Ridge National Laboratory and University Rovira i Virgili. The target scientific application is the description of the electronic structure of molecular assemblies in terms of basis functions that can be interpreted as a particular combination of molecular electronic states. The electronic states obtained in this basis can be interpreted directly in terms of molecular states, and with appropriate unitary transformations, the canonical molecular orbitals can be transformed into a description resembling the valence bond picture for the description of the electronic structure of molecules in terms of Lewis structures. The method would can also allow an accurate description of processes that occur locally, like excitation of one molecule in the nanostructure. The basic methodology consists of generation of spin adapted, antisymmetrized, combinations (SAACs) of molecular wavefunctions, followed by a non-orthogonal configuration interaction (NOCI) calculation using these SAACs as many-electron basis functions (MEBFs). The NOCI wavefunction of a cluster/ensemble of molecules is written as a linear combination of MEBFs, each formed as a SAAC of the wavefunctions for a particular state for each molecule in the ensemble. Using fully relaxed molecular electronic states and correlated molecular wavefunctions has the advantage that orbital relaxation and local correlation effects can be properly included in the description of the locally excited states, while avoiding lengthy CI expansions. The implementation of the NOCI method in GronOR is interfaced with OpenMolcas to obtain the CASSCF CI vector, the state specific CASSCF orbitals, and the required two-electron integrals. The evaluation of the Hamiltonian matrix elements involves many contributions in the form of determinant pairs that can be calculated independently. For the massively parallel implementation of the algorithm we adopted a task-based approach with a master/worker model to achieve load balancing and fault-resilient execution. (http://www.gronor.org)

*Dmitry I. Lyakh*

ExaTENSOR is designed as an advanced software library for large-scale numerical tensor algebra workloads on large-scale heterogeneous HPC platforms, including HPC clusters and leadership HPC systems, with applications in electronic structure simulations, quantum circuit simulations, and generic data analytics. ExaTENSOR provides a set of user-level API functions as well as an internal programming language (TAProL) which can be used for performing basic tensor algebra operations, e.g., tensor contractions, tensor additions, etc., on distributed HPC architectures equipped with accelerators. Although the immediate focus was specifically on the NVIDIA GPU accelerators, the ExaTENSOR design is based on hardware virtualization and separation of the algorithm expression from the hardware and system specificity that was inspired by some prior works (CLUSTER, ACES III/IV). Essentially, the ExaTENSOR parallel runtime is a domain-specific virtual machine (virtual processor) capable of directly interpreting and executing basic tensor algebra operations in a platform independent way. The ExaTENSOR hardware virtualization mechanism encapsulates the complexity of the node architecture and the system scale, thus, in principle, making possible to run the same numerical tensor algebra workload efficiently on many different HPC platforms. Internally, the ExaTENSOR parallel runtime implements the hierarchical task parallelism, thus properly adjusting the task granularity for each computing unit. ExaTENSOR supports accelerators in a plug-and-play way: A new hardware accelerator will only require a single-node library that implements the required tensor algebra primitives. This driver library will then be integrated under the hardware agnostic interface called TAL-SH (shared-memory tensor algebra layer).

*Rio Yokota*

In classical molecular dynamics simulations, the electrostatic (Coulomb) potential induces a global interaction between atoms. When calculated directly, this requires a computational cost of O(N^2) for N atoms. A common fast algorithm for calculating electrostatic forces is the particle-mesh Ewald (PME) method, which derives its speed from the efficiency of FFT for problems with high uniformity. The recent trend in hardware architectures with increasing parallelism poses a challenge for these FFT-based algorithms. Therefore, alternative algorithms such as the fast multipole method (FMM) and multilevel summation (MSM) are being considered. If we are to transition to such alternatives, a common interface between these alternatives must be developed. The developers of NAMD and GROMACS are interested in this approach.

*Arnold Tharrington*

We are developing a long-range electrostatic solver that is performance portable and targets HPC centers with hybrid CPU-GPU architectures. The solver uses the Multilevel Summation Method (MSM) which is a local (nearest-neighbor communication) hierarchal grid based algorithm. The current MSM developmental activities can be grouped in two broad categories. The first category is disentangling the MSM algorithm from the underlying HPC architecture hardware. This is primarily accomplished by software abstraction layers between the MSM algorithm and the CPU and GPU compute devices. This design feature helps performance portability by minimizing the amount of code modifications needed for various HPC CPU/GPU architectures and the ongoing improvements in the GPU hardware and CUDA API. The second activity is the implementation of a CUDA direct kernel for the direct stencil calculation on the grid hierarchies. This direct kernel uses large stencils with minimum dimensions of 13x13x13 and will explore the use of unified memory. Importantly, the direct kernel is not a simple function but a C++ class that is designed by composition and derivation from an abstract base class. This design structure helps attain performance portability and permits rapid implementation of other types of large stencil calculations.

*Ying Wai Li*

OWL is a scientific software for performing large-scale Monte Carlo simulations for the study of finite-temperature properties of materials. Originally developed to implement a special Monte Carlo method called Wang-Landau sampling (hence its name OWL: Oak-Ridge Wang-Landau), OWL now provides a collection of commonly used parallel, classical Monte Carlo algorithms suitable for running on high performance computers (HPC). OWL is written in C++ with an object-oriented, modular software architecture that disentangles the implementation for the physical systems from the algorithms. This design not only allows for the extension to various modern and parallel Monte Carlo algorithms; more importantly, it provides two modes for the calculation of physical observables of the systems in question – OWL can be run in the stand-alone mode for user-implemented model Hamiltonians, it can also be run in the “driver” mode that drives an external package as a library for energy calculations. This encourages reuse of community codes, and is particularly useful when the energies are calculated by first-principles methods such as density functional theory. OWL adopts the heterogeneous “MPI+X” programming model. It has an MPI task manager to arrange computer resources for different tasks as well as for the external library. While Monte Carlo algorithms reside on the MPI level and scalability is achieved by employing multiple walkers, energy calculations are parallelized using both the MPI and the “X” (X = OpenMP, CUDA, etc.) levels. As of today, OWL provides interfaces to Quantum Espresso and an ORNL-developed density functional theory code, Locally Self-Consistent Multiple Scattering (LSMS), to perform first-principles based statistical mechanics simulations. OWL is under active development; supports and interfaces to other software packages are on the way. We intend to make OWL available to the community on Github, with a website that provides detailed building instructions and documentations.

*Markus Eisenbach*

LSMS is a first principles, Density Functional theory based, electronic structure code targeted mainly at materials applications. LSMS calculates the local spin density approximation to the diagonal part of the electron Green’s function. The electron/spin density and energy are easily determined once the Green’s function is known. Linear scaling with system size is achieved in the LSMS by using several unique properties of the real space multiple scattering approach to the Green’s function 1) the Green’s function is “nearsighted”, therefore, each domain, i.e. atom, requires only information from nearby atoms in order to calculate the local value of the Green’s function. 2) the Green’s function is analytic, therefore, the required integral over electron energy levels can be analytically continued onto a contour in the complex plane where the imaginary part of the energy further restricts its range; and 3) to generate the local electron/spin density an atom needs only a small about of information, phase shifts, from those atoms within the range of the Green’s function. The very compact nature of the information that needs to be passed between processors and the high efficiency of the dense linear algebra algorithms employed to calculate the Green’s function are responsible for the superior performance of the LSMS code. In addition of non relativistic and scalar relativistic calculations, LSMS allows the solution of the fully relativistic Dirac equation for electron scattering. Thus, all relativistic effects including spin-orbit interactions are accounted for, which allows the calculation of magnetocrystaline anisotropy energies and Dzyaloshinskii-Moriya antisymmetric exchange interactions. The energies for arbitrary non-collinear magnetic spin configurations can be calculated using self- consistently determined Lagrange multipliers that constrain the local magnetic order. LSMS utilizes multiple levels of parallelism: 1) distributed memory parallelism via MPI to parallelize over the atoms in the system, 2) On node, shared memory, parallelism is achieved for both parallelization over atoms as well as over energy points on the integration contour, 3) the calculation of the multiple scattering matrix uses GPU acceleration when available. An additional level of parallelism is provided by the capability to perform Wang-Landau Monte-Carlo sampling of magnetic and chemical order. This allows the first principles statistical physics calculation of magnetic and ordering phase transitions. By utilizing multiple Monte-Carlo walkers, the LSMS scalability is extended by multiple orders of magnitude. To provide better scalability of the recently developed full-potential version of LSMS, new approaches to solving the Poisson equation are explored to obtain electrostatic potentials from space filling charge densities. Current efforts are underway to release LSMS under an Open Source license and to make it available to the wider scientific community.

**Applications Working Group Monthly Seminars**

Date | Host | Attn | Speaker | Title |

Oct 23, 2020 | CSC | 20 | Georgios Markomanolis | Towards a Benchmark Methodology |

CSCS | Anton Kozhevnikov | HPC libraries for the electronic structure domain | ||

Nov 27, 2020 | JFZ | 27 | Andreas Herten | Enabling Applications for one of Europe’s Largest GPU Systems: The JUWELS Booster Early Access Program |

Jaro Hokkanen | Performance portability on HPC accelerator architectures with modern techniques: The ParFlow blueprint | |||

Jan 21, 2021 | AIST | 16 | Peng Chen | High-resolution Image Reconstruction on ABCI supercomputer |

Hiroki Kanazashi | Optimizing Data Allocations to Optane DCPMM and DRAM for Billion-Scale Graphs | |||

Feb 26, 2021 | RIKEN | 19 | Naoki Yoshioka | Simulation of quantum circuits in classical supercomputers |

Makoto Tsubokura | Viral Droplet/Aerosol Dispersion Simulation on the Supercomputer “Fugaku” and Fight Back against COVID-19 | |||

Mar 26, 2021 | ORNL | 20 | Norbert Podhorszki | Codesign of I/O for Exascale data: Enhancing ADIOS for extreme-scale data movement |

Scott Klasky | MGARD: Hierarchical Compression of Scientific Data | |||

Apr 22, 2021 | NCI | 10 | Rui Yang | Jupyter-Dask based python framework |

Matthew Downton | Experience scaling bioinformatics pipelines at NCI | |||

Jun 25, 2021 | Univ Tokyo | 22 | Yohei Miki | Gravitational octree code performance on NVIDIA A100 |

Naoyuki Onodera | GPU acceleration of tracer dispersion simulation using the locally mesh-refined lattice Boltzmann method | |||

Jul 23, 2021 | ANL | 24 | Ye Luo | Enabling Performance Portable QMCPACK on Exascale Supercomputers |

JaeHyuk Kwack | Porting GAMESS RI-MP2 mini-app from CPU to GPU with roofline performance analysis | |||

Aug 26, 2021 | LLNL | 7 | Sam Jacobs Brian Van Essen | A Scalable Deep Learning Toolkit for Leadership-class Large Scale Scientific Machine |

Stephanie Brink Olga Pearce | Pinpointing Performance Bottlenecks with Hatchet | |||

Oct 22, 2021 | CEA | 13 | Luigi Genovese | Broadening the scope of large-scale DFT calculations |

Dec 3, 2021 | KTH | 22 | Niclas Jansson | Refactoring legacy Fortran applications to leverage modern heterogeneous architectures in extreme-scale CFD |

Szilárd Páll | Bringing GROMACS to exascale-generation heterogeneous architectures | |||

Feb 24, 2022 | Tokyo Tech | |||

Mar 25, 2022 | RIKEN | 8 | Takami Tohyama | Applications of time-dependent density-matrix renormalization group to strongly correlated electron systems |

| Kenta Sueki | Ensemble Kalman Filter-Based Parameter Estimation for Atmospheric Models: Parameter Uncertainty for Reliable and Efficient Estimation | ||

Apr 22, 2022 | LLNL | 19 | Brian Van Essen | Second-order Optimization with Sub-graph Parallelism on LBANN for COVID-19 Small Molecule Drug Design |

May 26, 2022 | AIST | Erik Deumens | High-Performance Computing in the World of Data and Artificial Intelligence | |

Ricardo Macedo | Building User-level Storage Data Planes with PAIO | |||

Jun 24, 2022 | Univ Tokyo | 12 | Kohei Fujita | GPU porting of implicit solver with Green’s function-based neural networks |

Yohei Miki | Porting N-body code to AMD GPU and performance evaluation | |||

Jul 22, 2022 | JFZ | 18 | Dennis Willsch | Jülich Universal Quantum Computer Simulator |

Bartosz Kostrzewa | Enabling state-of-the-art lattice QCD simulations using a legacy code on life support and the QUDA library | |||

Sep 1, 2022 | ORNL | 20 | Bronson Messer | An Introduction to the Frontier: Building and Using the World’s First Exascale Computer |

Sep 23, 2022 | CSCS | Sebastian Keller | Cornerstone Octree: domain decomposition on GPUs for particle-based simulations | |

Oct 28, 2022 | CSC | 12 | Martti Louhivuori, Cristian-Vasile Achim Jaro Hokkanen | LUMI GPU porting of three scientific applications |

Dec 1, 2022 | NCI | 8 | Ben Evans | Data Science and AI/ML activities at ANU |

Rui Yang | Data science and AI/ML software | |||

Maruf Ahmed | FourCastNet, IceNet, and ImageNet | |||

Jan 27, 2023 | Tokyo Tech | Rio Yokota | Training large vision+language models as an INCITE project | |

Qianxiang Ma | Scalable Linear Time Dense Direct Solver for 3-D Problems Without Trailing Sub-Matrix Dependencies | |||

Mar 23, 2023 | ORNL | 11 | Woong Shin | Post-Exascale HPC Energy Efficiency – Increasing Energy Awareness |

Apr 28, 2023 | RIKEN | Andrès Xavier Rubio Proano | Performance Metrics for Managing Heterogeneous Memory in HPC Applications | |

Eisuke Kawashima | Development of Massively Parallelized Quantum Chemical Software NTChem on Fugaku | |||

May 26, 2023 | ANL | 16 | Justin Wozniak | Deep Learning Workflows with CANDLE |

Jun 22, 2023 | Univ Tokyo | 11 | Kazuya Yamazaki | Porting an atmospheric model to GPUs using OpenACC |

Ryohei Kobayashi | Accelerating astrophysics simulation with GPUs and FPGAs | |||

Jul 28, 2023 | CSCS | Raffaele Solcà Mikael Simberg | Experiences with C++ std::execution: DLA-Future | |

Aug 25, 2023 | CSC | Martti Louhivuori | Header Only Porting: a light-weight header-only library for CUDA/HIP porting |