Will my job run twice as fast if I use the Haswell nodes?
Haswell is the code name for the latest generation of Intel Xeon chips. Peregrine has 1152 new nodes, each with two 12-core Haswell processors. The peak performance of this processor is twice that of the Ivy Bridge Xeon processors. Will your code run twice as fast?
Haswell has a new instruction set, called AVX2. This provides FMA (fused multiply-add) support, which computes a*b + c in one step rather than first computing a*b and then adding c. This doubles the peak number of floating point operations that can be done on each clock cycle relative to the older SandyBridge and IvyBridge Xeon chips. The integer vector registers have been extended from 128 bits to 256 bits, so twice as much work (for example, calculating absolute values of integers in an array)can be done per cycle. Also, “gather" support has been enhanced so that vector elements can be loaded from non-contiguous memory locations.
-> If your code is well vectorized or spends most of its time in well vectorized math library functions that can use the new AVX2 instructions, you may see substantially increased performance on Haswell relative to IvyBridge. Linear algebra, such as dot products or matrix-matrix multiplication can often use FMA instructions. Taking advantage of the new instructions requires an updated application binary or math library which has been built to use these instructions.
-> If your code is not vectorized, you won’t see any performance increase due to the new vector instruction set.
You are likely to see a performance increase from other improvements Intel made in the chip. Haswell delivers about 10% higher instructions per clock cycle due to improvements in branch prediction, larger and deeper buffers, higher bandwidth L1 and L2 caches and higher memory bandwidth. No changes to the application binary are needed to access increased performance from these improvements.
A new feature, called “Uncore frequency scaling" allows chip components that are outside the processors cores to scale their frequency up or down depending on the nature of the code being executed. This includes the L3 cache and the on-chip network that connects the cores to each other and to the memory controllers. As a result, applications that are bound by memory and cache latency rather than by available arithmetic units can drive the “uncore” faster (without increasing the core frequency), which leads to better performance.
An NREL technical report with additional information about Haswell is available at http://www.nrel.gov/docs/fy16osti/64268.pdf.