2016 Peregrine User Satisfaction Survey Results
The results of the 2016 Peregrine User Satisfaction Survey are now available. To maintain anonymity, we have removed all demographic information and anything that might identify respondents. Thank you to everyone who responded.
As a result of the survey, we will address several of the concerns in our FY17 planning activities. Some of the major themes that came up and the plans we have to address those issues include the following:
Running more jobs that use more cores:
We received feedback that people would like to be able to run more jobs that use more than 10,000 cores. Unfortunately, the topology of the network that connects nodes on Peregrine limits the number of nodes per job. The system is comprised of collections of nodes that are connected to each other in a non-blocking fat tree topology. Each of these groups of nodes is called a “scalable unit” or SU. Because the network that connects the SUs to each other and to the file systems has fewer connections, communicating between nodes in different scalable units is relatively slow. As a result, we restrict jobs to run using nodes that are all in the same scalable unit.
The initial Peregrine system was comprised of scalable units that contain 288 nodes. To allow jobs that use more than 288 nodes, when we expanded Peregrine in 2015 we acquired two scalable units that each has 576 nodes. The “batch-h” queue contains one full SU, which allows jobs to use up to 576 nodes (which corresponds to 13,824 cores). The nodes in the other SU are split between the batch-h queue (306 nodes), the large queue (262 nodes) and the Haswell queue (8 nodes). We could change how we distribute these nodes among queues to allow two simultaneous very large jobs but on Peregrine the maximum number of cores that can be used by a single job is 13,824. We expect that the system that replaces Peregrine will have the capability to run single jobs at larger scale.
New OS Image:
We plan to invest resources on creating an updated OS image for Peregrine during FY17. This represents a fairly large investment in resources to build, configure, and test the OS, Interconnect software, toolchains, and all compiled software. We expect this OS image to last the remaining life of Peregrine, which will be phased out starting in 2019 (original hardware) and 2020 (expansion hardware).
The Peregrine software environment is necessarily complex due to the need to support a wide, changing, and growing diversity of disciplines, packages, workflows, and users. In addition, the connections between system-level software and runtime libraries imply a tension between the dual and dueling requirements of system stability and enabling ever-advancing application software simply. The process we have followed is to ensure system stability on the one hand by keeping system-level software and default toolchains constant, while on the other hand making available updated toolchains, libraries, and applications as needed by users.
As individual and project needs arise, we are able to work with you to enable your workflows, either through monthly office hours or by special request. An e-mail to firstname.lastname@example.org is often all that’s needed to discover or acquire the tools you need. We will also be working to update the system software and associated default login environment as user needs require and our capacity permits.
To view the full report please go here: http://cs.hpc.nrel.gov/info/hpc-oversight/hpc-survey-results/2016-nrel-historical-detailed-summary-report-2016.pdf/view