Back to All Events

CANCELLED: Process Interrogation on Running Jobs

WILL BE RESCHEDULED IN 2024

Presenter: Doug Roberts, SHARCNET

For a job to run efficiently on a cluster one of the most fundamental requirements is that a single process runs on a single core. To achieve this slurm scripts are used to request resources for a specific number tasks per server based on some apriori knowledge how a program is expected to run directly according to predefined input file values or indirectly by slurm environment variables that are set at runtime. To make this procedure as reliable as possible, the Alliance wiki provides many slurm template scripts which researchers can use with minimal changes. In some cases however a program may not behave as expected and start more (or less) processes or threads within a give core reservation on a given compute node than is expected. As a result the compute node can become overloaded causing its total overall performance to rapidly deteriorate, potentially impacting other researcher jobs running on the node (including entire parallel jobs that have processes on the node) through excessive cpu load, memory bandwidth or file system calls. Eventually the node(s) may become unstable and unresponsive. If there are many such jobs launched simultaneously on a cluster then the operation of the entire cluster can be adversely impacted. Once aware of such a situation, the system administrator will set out to track down the problematic job(s) and owner. A decision will then need to be made, depending on the severity of the problem, to either suspend or terminate the jobs immediately or contact the researcher to request they fix it themselves in due course. In most cases the problem can be quickly resolved by correcting a parameter setting in the slurm script. In other cases however the source of the problem may not be obvious making it impossible to correct the script without doing further work. These types of scenarios may occur regardless if a researcher has written a custom code, downloaded a third party code or is running a commercial code. It will be the purpose of this presentation to provide researchers with some basic strategies and tools that can be used to interrogate running programs with the goal of understanding the running process and thread structure so that it may be fixed.

Previous
Previous
December 6

PAST: Block Internet Advertisements by Setting up Pi-Hole on an Older PCs or Laptop

Next
Next
January 17

PAST: False sharing and contention in parallel codes