LAMMPS Benchmarking on Phi Nodes
For information on how to run LAMMPS on Phi nodes, please refer to this page
The LAMMPS files used in this research can be downloaded on Peregrine from this directory:
The PBS script used is the following:
#PBS -l walltime=2:00:00 # WALLTIME
#PBS -l nodes=2:ppn=16 # Number of nodes and processes per node
#PBS -N lmpPhi_test
#PBS -o std.out
#PBS -e std.err
#PBS -A hpc-apps
#PBS -q phi
module use /nopt/nrel/apps/modules/candidate/modulefiles
module load impi-intel/4.1.3-14.0.2 mkl/14.2.144 lammps/10Feb2015-phi
#For debugging the node: Print out which nodes are using
cat $PBS_NODEFILE > node.log
mpirun -np 32 lmp_intel_phi -in lmp.in -l lmp.out -v p 2 -v b -1
#Print out the time used in the most recent run
grep Loop lmp.out | tail -1 > time.log
All LAMMPS runs here were using the automatic offloading. For short experimental runs to measure performance using the automatic offloading, it is recommended to add a short warm-up run before the production run, so that start-up penalties are not included in the total run time reported by the LAMMPS output. For example:
The first 1000-step is for the warm-up only and the performance is measured using the time reported by LAMMPS for the 500-step run.
We performed LAMMPS runs with the following configurations:
|Job number||Node type||# of nodes||Total Phi cards used||Use Intel package||# of MPI||# of OpenMP Threads per MPI|
Please note for multi-nodes runs, we use 2 OpenMP threads per MPI task to achieve better performance. The result is shown in the graph below. Performance is described in how many nanoseconds that can be completed per day for this MD simulation.
Even without Phi offloading, using the Intel package alone can speed-up by as much as X1.7 times.
On one Phi node, using offloading to two Phi cards can decrease computational time by X2.4 times.