Projects
ADAC Projects
Applications Working Group
Tjerk P. Straatsma and Coen de Graaf
GronOR is a nonorthogonal configuration interaction (NOCI) application, developed by the University of Groningen, Oak Ridge National Laboratory and University Rovira i Virgili. The target scientific application is the description of the electronic structure of molecular assemblies in terms of basis functions that can be interpreted as a particular combination of molecular electronic states. The electronic states obtained in this basis can be interpreted directly in terms of molecular states, and with appropriate unitary transformations, the canonical molecular orbitals can be transformed into a description resembling the valence bond picture for the description of the electronic structure of molecules in terms of Lewis structures. The method would can also allow an accurate description of processes that occur locally, like excitation of one molecule in the nanostructure. The basic methodology consists of generation of spin adapted, antisymmetrized, combinations (SAACs) of molecular wavefunctions, followed by a nonorthogonal configuration interaction (NOCI) calculation using these SAACs as manyelectron basis functions (MEBFs). The NOCI wavefunction of a cluster/ensemble of molecules is written as a linear combination of MEBFs, each formed as a SAAC of the wavefunctions for a particular state for each molecule in the ensemble. Using fully relaxed molecular electronic states and correlated molecular wavefunctions has the advantage that orbital relaxation and local correlation effects can be properly included in the description of the locally excited states, while avoiding lengthy CI expansions. The implementation of the NOCI method in GronOR is interfaced with OpenMolcas to obtain the CASSCF CI vector, the state specific CASSCF orbitals, and the required twoelectron integrals. The evaluation of the Hamiltonian matrix elements involves many contributions in the form of determinant pairs that can be calculated independently. For the massively parallel implementation of the algorithm we adopted a taskbased approach with a master/worker model to achieve load balancing and faultresilient execution. (http://www.gronor.org)
Dmitry I. Lyakh
ExaTENSOR is designed as an advanced software library for largescale numerical tensor algebra workloads on largescale heterogeneous HPC platforms, including HPC clusters and leadership HPC systems, with applications in electronic structure simulations, quantum circuit simulations, and generic data analytics. ExaTENSOR provides a set of userlevel API functions as well as an internal programming language (TAProL) which can be used for performing basic tensor algebra operations, e.g., tensor contractions, tensor additions, etc., on distributed HPC architectures equipped with accelerators. Although the immediate focus was specifically on the NVIDIA GPU accelerators, the ExaTENSOR design is based on hardware virtualization and separation of the algorithm expression from the hardware and system specificity that was inspired by some prior works (CLUSTER, ACES III/IV). Essentially, the ExaTENSOR parallel runtime is a domainspecific virtual machine (virtual processor) capable of directly interpreting and executing basic tensor algebra operations in a platform independent way. The ExaTENSOR hardware virtualization mechanism encapsulates the complexity of the node architecture and the system scale, thus, in principle, making possible to run the same numerical tensor algebra workload efficiently on many different HPC platforms. Internally, the ExaTENSOR parallel runtime implements the hierarchical task parallelism, thus properly adjusting the task granularity for each computing unit. ExaTENSOR supports accelerators in a plugandplay way: A new hardware accelerator will only require a singlenode library that implements the required tensor algebra primitives. This driver library will then be integrated under the hardware agnostic interface called TALSH (sharedmemory tensor algebra layer).
Rio Yokota
In classical molecular dynamics simulations, the electrostatic (Coulomb) potential induces a global interaction between atoms. When calculated directly, this requires a computational cost of O(N^2) for N atoms. A common fast algorithm for calculating electrostatic forces is the particlemesh Ewald (PME) method, which derives its speed from the efficiency of FFT for problems with high uniformity. The recent trend in hardware architectures with increasing parallelism poses a challenge for these FFTbased algorithms. Therefore, alternative algorithms such as the fast multipole method (FMM) and multilevel summation (MSM) are being considered. If we are to transition to such alternatives, a common interface between these alternatives must be developed. The developers of NAMD and GROMACS are interested in this approach.
Arnold Tharrington
We are developing a longrange electrostatic solver that is performance portable and targets HPC centers with hybrid CPUGPU architectures. The solver uses the Multilevel Summation Method (MSM) which is a local (nearestneighbor communication) hierarchal grid based algorithm. The current MSM developmental activities can be grouped in two broad categories. The first category is disentangling the MSM algorithm from the underlying HPC architecture hardware. This is primarily accomplished by software abstraction layers between the MSM algorithm and the CPU and GPU compute devices. This design feature helps performance portability by minimizing the amount of code modifications needed for various HPC CPU/GPU architectures and the ongoing improvements in the GPU hardware and CUDA API. The second activity is the implementation of a CUDA direct kernel for the direct stencil calculation on the grid hierarchies. This direct kernel uses large stencils with minimum dimensions of 13x13x13 and will explore the use of unified memory. Importantly, the direct kernel is not a simple function but a C++ class that is designed by composition and derivation from an abstract base class. This design structure helps attain performance portability and permits rapid implementation of other types of large stencil calculations.
Ying Wai Li
OWL is a scientific software for performing largescale Monte Carlo simulations for the study of finitetemperature properties of materials. Originally developed to implement a special Monte Carlo method called WangLandau sampling (hence its name OWL: OakRidge WangLandau), OWL now provides a collection of commonly used parallel, classical Monte Carlo algorithms suitable for running on high performance computers (HPC). OWL is written in C++ with an objectoriented, modular software architecture that disentangles the implementation for the physical systems from the algorithms. This design not only allows for the extension to various modern and parallel Monte Carlo algorithms; more importantly, it provides two modes for the calculation of physical observables of the systems in question – OWL can be run in the standalone mode for userimplemented model Hamiltonians, it can also be run in the “driver” mode that drives an external package as a library for energy calculations. This encourages reuse of community codes, and is particularly useful when the energies are calculated by firstprinciples methods such as density functional theory. OWL adopts the heterogeneous “MPI+X” programming model. It has an MPI task manager to arrange computer resources for different tasks as well as for the external library. While Monte Carlo algorithms reside on the MPI level and scalability is achieved by employing multiple walkers, energy calculations are parallelized using both the MPI and the “X” (X = OpenMP, CUDA, etc.) levels. As of today, OWL provides interfaces to Quantum Espresso and an ORNLdeveloped density functional theory code, Locally SelfConsistent Multiple Scattering (LSMS), to perform firstprinciples based statistical mechanics simulations. OWL is under active development; supports and interfaces to other software packages are on the way. We intend to make OWL available to the community on Github, with a website that provides detailed building instructions and documentations.
Markus Eisenbach
LSMS is a first principles, Density Functional theory based, electronic structure code targeted mainly at materials applications. LSMS calculates the local spin density approximation to the diagonal part of the electron Green’s function. The electron/spin density and energy are easily determined once the Green’s function is known. Linear scaling with system size is achieved in the LSMS by using several unique properties of the real space multiple scattering approach to the Green’s function 1) the Green’s function is “nearsighted”, therefore, each domain, i.e. atom, requires only information from nearby atoms in order to calculate the local value of the Green’s function. 2) the Green’s function is analytic, therefore, the required integral over electron energy levels can be analytically continued onto a contour in the complex plane where the imaginary part of the energy further restricts its range; and 3) to generate the local electron/spin density an atom needs only a small about of information, phase shifts, from those atoms within the range of the Green’s function. The very compact nature of the information that needs to be passed between processors and the high efficiency of the dense linear algebra algorithms employed to calculate the Green’s function are responsible for the superior performance of the LSMS code. In addition of non relativistic and scalar relativistic calculations, LSMS allows the solution of the fully relativistic Dirac equation for electron scattering. Thus, all relativistic effects including spinorbit interactions are accounted for, which allows the calculation of magnetocrystaline anisotropy energies and DzyaloshinskiiMoriya antisymmetric exchange interactions. The energies for arbitrary noncollinear magnetic spin configurations can be calculated using self consistently determined Lagrange multipliers that constrain the local magnetic order. LSMS utilizes multiple levels of parallelism: 1) distributed memory parallelism via MPI to parallelize over the atoms in the system, 2) On node, shared memory, parallelism is achieved for both parallelization over atoms as well as over energy points on the integration contour, 3) the calculation of the multiple scattering matrix uses GPU acceleration when available. An additional level of parallelism is provided by the capability to perform WangLandau MonteCarlo sampling of magnetic and chemical order. This allows the first principles statistical physics calculation of magnetic and ordering phase transitions. By utilizing multiple MonteCarlo walkers, the LSMS scalability is extended by multiple orders of magnitude. To provide better scalability of the recently developed fullpotential version of LSMS, new approaches to solving the Poisson equation are explored to obtain electrostatic potentials from space filling charge densities. Current efforts are underway to release LSMS under an Open Source license and to make it available to the wider scientific community.
Applications Working Group Monthly Seminars
Date 
Host 
Attendance 
Speaker 
Title 





Oct 23, 2020 
CSCS, CSC 
20 
Georgios Markomanolis 
Towards a Benchmark Methodology 



Anton Kozhevnikov 
HPC libraries for the electronic structure domain 





Nov 27, 2020 
JFZ 
27 
Andreas Herten 
Enabling Applications for one of Europe’s Largest GPU Systems: The JUWELS Booster Early Access Program 



Jaro Hokkanen 
Performance portability on HPC accelerator architectures with modern techniques: The ParFlow blueprint 





Jan 21, 2021 
AIST 
16 
Peng Chen 
Highresolution Image Reconstruction on ABCI supercomputer 



Hiroki Kanazashi 
Optimizing Data Allocations to Optane DCPMM and DRAM for BillionScale Graphs 





Feb 26, 2021 
RIKEN 
19 
Naoki Yoshioka 
Simulation of quantum circuits in classical supercomputers 



Makoto Tsubokura 
Viral Droplet/Aerosol Dispersion Simulation on the Supercomputer “Fugaku” and Fight Back against COVID19 





Mar 26, 2021 
ORNL 
20 
Norbert Podhorszki 
Codesign of I/O for Exascale data: Enhancing ADIOS for extremescale data movement 



Scott Klasky 
MGARD: Hierarchical Compression of Scientific Data 





Apr 22, 2021 
NCI 
10 
Rui Yang 
JupyterDask based python framework 



Matthew Downton 
Experience scaling bioinformatics pipelines at NCI 





Jun 25, 2021 
U Tokyo 
22 
Yohei Miki 
Gravitational octree code performance on NVIDIA A100 



Naoyuki Onodera 
GPU acceleration of tracer dispersion simulation using the locally meshrefined lattice Boltzmann method 





Jul 23, 2021 
ANL 
24 
Ye Luo 
Enabling Performance Portable QMCPACK on Exascale Supercomputers 



JaeHyuk Kwack 
Porting GAMESS RIMP2 miniapp from CPU to GPU with roofline performance analysis 





Aug 26, 2021 
LLNL 
7 
Sam Jacobs & Brian Van Essen 
A Scalable Deep Learning Toolkit for Leadershipclass Large Scale Scientific Machine 



Stephanie Brink & Olga Pearce 
Pinpointing Performance Bottlenecks with Hatchet 





Oct 22, 2021 
CEA 
13 
Luigi Genovese 
Broadening the scope of largescale DFT calculations 





Dec 3, 2021 
KTH 
22 
Niclas Jansson 
Refactoring legacy Fortran applications to leverage modern heterogeneous architectures in extremescale CFD 



Szilárd Páll 
Bringing GROMACS to exascalegeneration heterogeneous architectures 





Feb 24, 2022 
Tokyo Tech 








Mar 25, 2022 
RIKEN 
8 
Takami Tohyama 
Applications of timedependent densitymatrix renormalization group to strongly correlated electron systems 



Kenta Sueki 
Ensemble Kalman FilterBased Parameter Estimation for Atmospheric Models: Parameter Uncertainty for Reliable and Efficient Estimation 





Apr 22, 2022 
LLNL 
19 
Brian Van Essen 
Secondorder Optimization with Subgraph Parallelism on LBANN for COVID19 Small Molecule Drug Design 





May 26, 2022 
AIST 

Erik Deumens 
HighPerformance Computing in the World of Data and Artificial Intelligence 



Ricardo Macedo 
Building Userlevel Storage Data Planes with PAIO 





Jun 24, 2022 
U Tokyo 
12 
Kohei Fujita 
GPU porting of implicit solver with Green’s functionbased neural networks 



Yohei Miki 
Porting Nbody code to AMD GPU and performance evaluation 





Jul 22, 2022 
JFZ 
18 
Dennis Willsch 
Jülich Universal Quantum Computer Simulator 



Bartosz Kostrzewa 
Enabling stateoftheart lattice QCD simulations using a legacy code on life support and the QUDA library 





Sep 1, 2022 
ORNL 
20 
Bronson Messer 
An Introduction to the Frontier: Building and Using the World’s First Exascale Computer 





Sep 23, 2022 
CSCS 

Sebastian Keller 
Cornerstone Octree: domain decomposition on GPUs for particlebased simulations 





Oct 28, 2022 
CSC 
12 
Martti Louhivuori, CristianVasile Achim & Jaro Hokkanen 
LUMI GPU porting of three scientific applications 





Dec 1, 2022 
NCI 

Ben Evans 
Data Science and AI/ML activities at ANU 



Rui Yang 
Data science and AI/ML software 



Maruf Ahmed 
FourCastNet, IceNet, and ImageNet 