ADAC Projects

Applications Working Group

Tjerk P. Straatsma and Coen de Graaf
GronOR is a non-orthogonal configuration interaction (NOCI) application, developed by the University of Groningen, Oak Ridge National Laboratory and University Rovira i Virgili. The target scientific application is the description of the electronic structure of molecular assemblies in terms of basis functions that can be interpreted as a particular combination of molecular electronic states. The electronic states obtained in this basis can be interpreted directly in terms of molecular states, and with appropriate unitary transformations, the canonical molecular orbitals can be transformed into a description resembling the valence bond picture for the description of the electronic structure of molecules in terms of Lewis structures. The method would can also allow an accurate description of processes that occur locally, like excitation of one molecule in the nanostructure. The basic methodology consists of generation of spin adapted, antisymmetrized, combinations (SAACs) of molecular wavefunctions, followed by a non-orthogonal configuration interaction (NOCI) calculation using these SAACs as many-electron basis functions (MEBFs). The NOCI wavefunction of a cluster/ensemble of molecules is written as a linear combination of MEBFs, each formed as a SAAC of the wavefunctions for a particular state for each molecule in the ensemble. Using fully relaxed molecular electronic states and correlated molecular wavefunctions has the advantage that orbital relaxation and local correlation effects can be properly included in the description of the locally excited states, while avoiding lengthy CI expansions. The implementation of the NOCI method in GronOR is interfaced with OpenMolcas to obtain the CASSCF CI vector, the state specific CASSCF orbitals, and the required two-electron integrals. The evaluation of the Hamiltonian matrix elements involves many contributions in the form of determinant pairs that can be calculated independently. For the massively parallel implementation of the algorithm we adopted a task-based approach with a master/worker model to achieve load balancing and fault-resilient execution. (

Dmitry I. Lyakh
ExaTENSOR is designed as an advanced software library for large-scale numerical tensor algebra workloads on large-scale heterogeneous HPC platforms, including HPC clusters and leadership HPC systems, with applications in electronic structure simulations, quantum circuit simulations, and generic data analytics. ExaTENSOR provides a set of user-level API functions as well as an internal programming language (TAProL) which can be used for performing basic tensor algebra operations, e.g., tensor contractions, tensor additions, etc., on distributed HPC architectures equipped with accelerators. Although the immediate focus was specifically on the NVIDIA GPU accelerators, the ExaTENSOR design is based on hardware virtualization and separation of the algorithm expression from the hardware and system specificity that was inspired by some prior works (CLUSTER, ACES III/IV). Essentially, the ExaTENSOR parallel runtime is a domain-specific virtual machine (virtual processor) capable of directly interpreting and executing basic tensor algebra operations in a platform independent way. The ExaTENSOR hardware virtualization mechanism encapsulates the complexity of the node architecture and the system scale, thus, in principle, making possible to run the same numerical tensor algebra workload efficiently on many different HPC platforms. Internally, the ExaTENSOR parallel runtime implements the hierarchical task parallelism, thus properly adjusting the task granularity for each computing unit. ExaTENSOR supports accelerators in a plug-and-play way: A new hardware accelerator will only require a single-node library that implements the required tensor algebra primitives. This driver library will then be integrated under the hardware agnostic interface called TAL-SH (shared-memory tensor algebra layer).

Rio Yokota
In classical molecular dynamics simulations, the electrostatic (Coulomb) potential induces a global interaction between atoms. When calculated directly, this requires a computational cost of O(N^2) for N atoms. A common fast algorithm for calculating electrostatic forces is the particle-mesh Ewald (PME) method, which derives its speed from the efficiency of FFT for problems with high uniformity. The recent trend in hardware architectures with increasing parallelism poses a challenge for these FFT-based algorithms. Therefore, alternative algorithms such as the fast multipole method (FMM) and multilevel summation (MSM) are being considered. If we are to transition to such alternatives, a common interface between these alternatives must be developed. The developers of NAMD and GROMACS are interested in this approach.

Arnold Tharrington
We are developing a long-range electrostatic solver that is performance portable and targets HPC centers with hybrid CPU-GPU architectures. The solver uses the Multilevel Summation Method (MSM) which is a local (nearest-neighbor communication) hierarchal grid based algorithm. The current MSM developmental activities can be grouped in two broad categories. The first category is disentangling the MSM algorithm from the underlying HPC architecture hardware. This is primarily accomplished by software abstraction layers between the MSM algorithm and the CPU and GPU compute devices. This design feature helps performance portability by minimizing the amount of code modifications needed for various HPC CPU/GPU architectures and the ongoing improvements in the GPU hardware and CUDA API. The second activity is the implementation of a CUDA direct kernel for the direct stencil calculation on the grid hierarchies. This direct kernel uses large stencils with minimum dimensions of 13x13x13 and will explore the use of unified memory. Importantly, the direct kernel is not a simple function but a C++ class that is designed by composition and derivation from an abstract base class. This design structure helps attain performance portability and permits rapid implementation of other types of large stencil calculations.

Ying Wai Li
OWL is a scientific software for performing large-scale Monte Carlo simulations for the study of finite-temperature properties of materials. Originally developed to implement a special Monte Carlo method called Wang-Landau sampling (hence its name OWL: Oak-Ridge Wang-Landau), OWL now provides a collection of commonly used parallel, classical Monte Carlo algorithms suitable for running on high performance computers (HPC). OWL is written in C++ with an object-oriented, modular software architecture that disentangles the implementation for the physical systems from the algorithms. This design not only allows for the extension to various modern and parallel Monte Carlo algorithms; more importantly, it provides two modes for the calculation of physical observables of the systems in question – OWL can be run in the stand-alone mode for user-implemented model Hamiltonians, it can also be run in the “driver” mode that drives an external package as a library for energy calculations. This encourages reuse of community codes, and is particularly useful when the energies are calculated by first-principles methods such as density functional theory. OWL adopts the heterogeneous “MPI+X” programming model. It has an MPI task manager to arrange computer resources for different tasks as well as for the external library. While Monte Carlo algorithms reside on the MPI level and scalability is achieved by employing multiple walkers, energy calculations are parallelized using both the MPI and the “X” (X = OpenMP, CUDA, etc.) levels. As of today, OWL provides interfaces to Quantum Espresso and an ORNL-developed density functional theory code, Locally Self-Consistent Multiple Scattering (LSMS), to perform first-principles based statistical mechanics simulations. OWL is under active development; supports and interfaces to other software packages are on the way. We intend to make OWL available to the community on Github, with a website that provides detailed building instructions and documentations.

Markus Eisenbach
LSMS is a first principles, Density Functional theory based, electronic structure code targeted mainly at materials applications. LSMS calculates the local spin density approximation to the diagonal part of the electron Green’s function. The electron/spin density and energy are easily determined once the Green’s function is known. Linear scaling with system size is achieved in the LSMS by using several unique properties of the real space multiple scattering approach to the Green’s function 1) the Green’s function is “nearsighted”, therefore, each domain, i.e. atom, requires only information from nearby atoms in order to calculate the local value of the Green’s function. 2) the Green’s function is analytic, therefore, the required integral over electron energy levels can be analytically continued onto a contour in the complex plane where the imaginary part of the energy further restricts its range; and 3) to generate the local electron/spin density an atom needs only a small about of information, phase shifts, from those atoms within the range of the Green’s function. The very compact nature of the information that needs to be passed between processors and the high efficiency of the dense linear algebra algorithms employed to calculate the Green’s function are responsible for the superior performance of the LSMS code. In addition of non relativistic and scalar relativistic calculations, LSMS allows the solution of the fully relativistic Dirac equation for electron scattering. Thus, all relativistic effects including spin-orbit interactions are accounted for, which allows the calculation of magnetocrystaline anisotropy energies and Dzyaloshinskii-Moriya antisymmetric exchange interactions. The energies for arbitrary non-collinear magnetic spin configurations can be calculated using self- consistently determined Lagrange multipliers that constrain the local magnetic order. LSMS utilizes multiple levels of parallelism: 1) distributed memory parallelism via MPI to parallelize over the atoms in the system, 2) On node, shared memory, parallelism is achieved for both parallelization over atoms as well as over energy points on the integration contour, 3) the calculation of the multiple scattering matrix uses GPU acceleration when available. An additional level of parallelism is provided by the capability to perform Wang-Landau Monte-Carlo sampling of magnetic and chemical order. This allows the first principles statistical physics calculation of magnetic and ordering phase transitions. By utilizing multiple Monte-Carlo walkers, the LSMS scalability is extended by multiple orders of magnitude. To provide better scalability of the recently developed full-potential version of LSMS, new approaches to solving the Poisson equation are explored to obtain electrostatic potentials from space filling charge densities. Current efforts are underway to release LSMS under an Open Source license and to make it available to the wider scientific community.

Applications Working Group Monthly Seminars






Oct 23, 2020



Georgios  Markomanolis

Towards a Benchmark Methodology


Anton Kozhevnikov

HPC libraries for the electronic structure domain

Nov 27, 2020



Andreas Herten

Enabling Applications for one of Europe’s Largest GPU Systems: The JUWELS Booster Early Access Program

Jaro Hokkanen

Performance portability on HPC accelerator architectures with modern techniques: The ParFlow blueprint

Jan 21, 2021



Peng Chen

High-resolution Image Reconstruction on ABCI supercomputer

Hiroki Kanazashi

Optimizing Data Allocations to Optane DCPMM and DRAM for Billion-Scale Graphs

Feb 26, 2021



Naoki Yoshioka

Simulation of quantum circuits in classical supercomputers

Makoto Tsubokura

Viral Droplet/Aerosol Dispersion Simulation on the Supercomputer “Fugaku” and Fight Back against COVID-19

Mar 26, 2021



Norbert Podhorszki

Codesign of I/O for Exascale data: Enhancing ADIOS for extreme-scale data movement

Scott Klasky

MGARD: Hierarchical Compression of Scientific Data

Apr 22, 2021



Rui Yang

Jupyter-Dask based python framework

Matthew Downton

Experience scaling bioinformatics pipelines at NCI

Jun 25, 2021

Univ Tokyo


Yohei Miki

Gravitational octree code performance on NVIDIA A100

Naoyuki Onodera

GPU acceleration of tracer dispersion simulation using the locally mesh-refined lattice Boltzmann method

Jul 23, 2021



Ye Luo

Enabling Performance Portable QMCPACK on Exascale Supercomputers

JaeHyuk Kwack

Porting GAMESS RI-MP2 mini-app from CPU to GPU with roofline performance analysis

Aug 26, 2021



Sam Jacobs

Brian Van Essen

A Scalable Deep Learning Toolkit for Leadership-class Large Scale Scientific Machine

Stephanie Brink

Olga Pearce

Pinpointing Performance Bottlenecks with Hatchet

Oct 22, 2021



Luigi Genovese

Broadening the scope of large-scale DFT calculations

Dec 3, 2021



Niclas Jansson

Refactoring legacy Fortran applications to leverage modern heterogeneous architectures in extreme-scale CFD

Szilárd Páll

Bringing GROMACS to exascale-generation heterogeneous architectures

Feb 24, 2022

Tokyo Tech

Mar 25, 2022



Takami Tohyama

Applications of time-dependent density-matrix renormalization group to strongly correlated electron systems


Kenta Sueki

Ensemble Kalman Filter-Based Parameter Estimation for Atmospheric Models: Parameter Uncertainty for Reliable and Efficient Estimation

Apr 22, 2022



Brian Van Essen

Second-order Optimization with Sub-graph Parallelism on LBANN for COVID-19 Small Molecule Drug Design

May 26, 2022


Erik Deumens

High-Performance Computing in the World of Data and Artificial Intelligence

Ricardo Macedo

Building User-level Storage Data Planes with PAIO

Jun 24, 2022

Univ Tokyo


Kohei Fujita

GPU porting of implicit solver with Green’s function-based neural networks

Yohei Miki

Porting N-body code to AMD GPU and performance evaluation

Jul 22, 2022



Dennis Willsch

Jülich Universal Quantum Computer Simulator

Bartosz Kostrzewa

Enabling state-of-the-art lattice QCD simulations using a legacy code on life support and the QUDA library

Sep 1, 2022



Bronson Messer

An Introduction to the Frontier: Building and Using the World’s First Exascale Computer

Sep 23, 2022


Sebastian Keller

Cornerstone Octree: domain decomposition on GPUs for particle-based simulations

Oct 28, 2022



Martti Louhivuori, 

Cristian-Vasile Achim

Jaro Hokkanen

LUMI GPU porting of three scientific applications

Dec 1, 2022



Ben Evans

Data Science and AI/ML activities at ANU

Rui Yang

Data science and AI/ML software

Maruf Ahmed

FourCastNet, IceNet, and ImageNet

Jan 27, 2023

Tokyo Tech

Rio Yokota

Training large vision+language models as an INCITE project

Qianxiang Ma

Scalable Linear Time Dense Direct Solver for 3-D Problems Without Trailing Sub-Matrix Dependencies

Mar 23, 2023



Woong Shin

Post-Exascale HPC Energy Efficiency – Increasing Energy Awareness

Apr 28, 2023


Andrès Xavier Rubio Proano

Performance Metrics for Managing Heterogeneous Memory in HPC Applications

Eisuke Kawashima

Development of Massively Parallelized Quantum Chemical Software NTChem on Fugaku

May 26, 2023



Justin Wozniak

Deep Learning Workflows with CANDLE

Jun 22, 2023

Univ Tokyo


Kazuya Yamazaki

Porting an atmospheric model to GPUs using OpenACC

Ryohei Kobayashi

Accelerating astrophysics simulation with GPUs and FPGAs

Jul 28, 2023


Raffaele Solcà

Mikael Simberg

Experiences with C++ std::execution: DLA-Future

Aug 25, 2023


Martti Louhivuori

Header Only Porting: a light-weight header-only library for CUDA/HIP porting