How to Run LAMMPS on Phi Nodes
LAMMPS is a classical molecular dynamics program and has potentials for simulations of solid-state, soft matter, and coarse-grained systems. LAMMPS can be run on a single processor or in parallel using MPI. To find more information about LAMMPS, please refer to its official webpage. In addition to parallelization via MPI, LAMMPS enables simulations to be accelerated via offloading neighbor list and non-bonded force calculations to a Phi card. The aim of this document is to provide instructions for running LAMMPS on Peregrine using Phi nodes. For information on how to request a Phi node, please refer to this page.
Change the LAMMPS input
The following four lines should be added to the beginning of the LAMMPS input file:
package intel $p mode mixed balance $b
package omp 0
processors * * * grid numa
In the first line, "package intel" indicates the use of the USER-INTEL package of the LAMMPS to offload to the phi cards. The number of Phi cards used per node is $p, and it can be set to 0, 1, and 2. Setting it to 0 turns off the offloading. The offloading "mode" is "mixed" precision. The "mixed" precision mode is the default and means that the forces between a pair of atoms are computed in single precision, but accumulated and stored in double precision, including storage of forces, torques, energies, and virial quantities. The offloading mode can also be set to "single" or "double" precision. The "balance" keyword sets the fraction of work offloaded to the Phi coprocessor ($b). $b can be set between 0.0 and 1.0. The optimal $b can be determined by the user by running multiple short test runs. LAMMPS can also determine automatically on how much should be offloaded to the Phi card if the user sets $b to -1, in which the fraction of work is dynamically adjusted automatically throughout the run. By setting $b = -1, a job typically runs 5~10% slower than the case by setting $b to the optimal offloading fraction. The $p and $b variables can be defined when running LAMMPS as shown in the following example.
Sample PBS script
#PBS -l walltime=2:00:00 # WALLTIME
#PBS -l nodes=2:ppn=16 # Number of nodes and processes per node
#PBS -N lmpPhi_test
#PBS -o std.out
#PBS -e std.err
#PBS -A [Your Project Account]
#PBS -q phi
module use /nopt/nrel/apps/modules/candidate/modulefiles
module load impi-intel/4.1.3-14.0.2 mkl/14.2.144 lammps/10Feb2015-phi
mpirun -np 32 lmp_intel_phi -in lmp.in -l lmp.out -v p 2 -v b -1
#Print out the time used in the most recent run
grep Loop lmp.out | tail -1 > time.log
In the "mpirun" line, the value after "-v p" sets the $p variable (which is 2 in this example) and the value after "-v b" sets the $b (which is -1 in this example). This sample PBS script requests 2 Phi nodes ("nodes=2") and sets LAMMPS to use MPI with 32 host cores (16 on each node, "ppn=16") with the offloading to the 4 Phi cards (2 Phi cards per node by using "-v p 2"), and using the automatically offloading ("-v b -1"). Please also note the executable "lmp_intel_phi" is used instead of "lmp" in the typical LAMMPS runs.
Sample LAMMPS Run
You can download files for a sample Phi run at /nopt/nrel/apps/lammps/examples/phi-test on Peregrine. The only change that you need to make is to change the "hpc-apps" on line 7 of the pbs script into your project handle. The computational time reported in "time.log" file should be ~14 seconds. The performance of this job using different number of nodes and Phi cards is reported here.
Run LAMMPS on non-Phi nodes
The Intel package will speed-up your simulation even without any Phi offloading. You can use the "lmp_intel_phi" binary on the non-Phi nodes (such as the Ivybridge nodes) by setting $b = 0. Please read this page for more information.