Scalable Computing on Peregrine
A module tree containing basic toolchains as well as low level libraries useful for scalable computing is on Peregrine located at:
Containing the following modules described below:
---------------------------------------- /nopt/nrel/apps/modules/hpc/modulefiles ----------------------------------------
adios/1.3.1 mvapich-gcc/2.0a-4.8.1-dbg mvapich-intel/2.0a-13.1.1-shared
adios/1.5.0(default) mvapich-gcc/2.0a-4.8.1-shared mxml/2.7(default)
mvapich-gcc/2.0a-4.8.1(default) mvapich-intel/2.0a-13.1.1(default) petsc/3.4.3(default)
For building high performance scalable applications my recommendation is to use the mvapich-intel/2.0a-13.1.1 toolchain if static linking is sufficient or the mvapich-intel/2.0a-13.1.1-shared toolchain if shared object support is necessary.
The source distribution, build and installation of these modules are located in the regular tree (e.g., /nopt/nrel/apps/mvapich2).The toolchains have been built so as to enable thread safety, which mvapich2 supports as well as leveraging IB. The configure options for the mvapich2 builds are:
./configure --prefix=$destdir --with-hwloc --enable-threads=runtime -enable-rdma-cm --enable-nemesis-shm-collectives)
Both the gcc and intel builds have performance verified with IMB at 4096 cores. The Ping-Pong latency test performance is:
|PingPong Message Size||0B
The bandwidth as measured by the SendRecv for various processes group sizes in the table below, where the test was run for a range of message sizes (0-4MB) and the message size corresponding to the highest bandwidth achieved was recorded:
|Process group size||Peak bandwidth (MB/sec)||Message Size(MB)||Peak bandwidth (MB/sec)||Message Size (MB)|
As an additional test of the scalability of the toolchains and system, and the ability to run hybrid MPI+OpenMP codes the miniMG benchmark developed by LBNL was used to verify performance on up to 500 nodes using the mvapich2-intel toolchain with 12 threads per MPI rank. The miniMG benchmark is a multithreaded (MPI+OpenMP) solver that implements a geometric multigrid solve in parallel and as such exercises on node (flops as well as memory bandwidth) as well as communication between nodes. For comparison, the time spent in the core multigrid solve is approximately 0.5s-0.8s on Edison (a NERSC resource with similar hardware) across a similar range of problem sizes and 0.4s-1.0s on Titan.
|Number of MPI ranks||MGSolve Time (s)|
In my experience it is valuable for application users and developers to be able to run lightweight interactive debugging sessions. The modern job launcher for mvapich2 – mpirun/mpiexec using the hydra process manager does not yet include a useful capability where jobs could be run under gdb in parallel where the output is consolidated to the login shell. To restore this capability as ‘special’ debug build of mvapich2 is available exposed in the module mvapich-gcc/2.0a-4.8.1-dbg, where the ‘old’ process manager is used. Jobs launched using the ‘-gdb’ flag will be run under the control of gdb.
The module adios/1.5.0 exposes a build of adios, described in detail at https://www.olcf.ornl.gov/center-projects/adios/. I/O can often consume a significant portion of simulation time at large scales, especially when the amount of data written per process is relatively small. ADIOS is I/O middleware that has a lightweight interface and a modular structure where various transformations can be performed on the data before it is eventually written out using a variety of legacy backends (HDF5, netCDF) or ADIOS’s own ‘bp’ format. A key capability provided by ADIOS is that the in-transit transformations can aggregate data from multiple MPI ranks before it is written out.