Trouble shooting Peregrine job failures: Part 1
If you are having trouble with Peregrine jobs failing and would like assistance trouble shooting them, gather as much information as you can about the job, particularly the job ID number. A common reason for jobs failing is due to nodes running out of memory, and the Linux "out-of-memory" system will start to kill processes. The processes are not terminated in a predictable manner, and the nodes are configured to reboot on kernel panic or out-of-memory conditions. The nodes may be able to leave clues in the system logs about why the node rebooted. The job exit code that is recorded in the output from the job may also provide information, however the codes are not always an accurate reflection of the cause of failure.
Here are links to standard exit codes we have seen on Peregrine:
The Linux Documentation Project
Adaptive Computing (Moab Exit Codes)
For further assistance send the job id number, if possible, to open a request at email@example.com. If you would like assistance configuring your jobs to gather information, such as placing debug statements in the job script or gathering the name of the job and the corresponding job id number, send your job scripts in with the request.