Queuing Systems & Scheduler – Information Technology Services

Job Scheduler(SLURM):

Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for queued researcher-defined jobs. Slurm has been deployed at various national and international computing centers, and by approximately 60% of the TOP500 supercomputers in the world.

You can learn more about SLURM and its commands from the official Slurm website.

Queuing System:

When a job is submitted, it is placed in a queue. There are different queues available for different purposes. The user must select any one of the queues from the ones listed below which is appropriate for his/her computation need.

Slurm partitions are essentially different queues that point to collections of nodes. On Mario there are three partitions:

post-proc: This partition has two compute node that has been set aside for post-processing and test jobs. This partition has a maximum time limit of 1 hour.
parallel-short: This partition has 11 compute nodes that have been set aside for running only parallel smaller jobs. This queue/partition has a maximum time limit of 12 hours.
serial-short: This partition has 8 compute nodes that have been set aside for running only longer jobs. This partition has 240 hours.
serial-long: This partition has 7 compute nodes that have been set aside for running only serial long jobs. This partition has no time limits.
parallel-long: This partition has 36 compute nodes that have been set aside for running only parallel longer jobs. This partition has 96 hours.

Queue name	No. of Nodes	Node list	Default walltime (day-hrs:min)	Total No. Actual CPUs	Total No. of CPUs with Hyper-Threading	About queue
test	2	cn[1-2]	0-:04:00:00	64	128	4hr time limit, this will be for post processing and testing
parallel-short	8	cn[3-10]	1-00:00:00	256	512	For paralle short jobs 24hrs (1 Day) time limit
serial-short	8	cn[11-18]	5-00:00:00	256	512	For serial short jobs 120 hrs (5 Days) time limit
serial-long	16	cn[19-34]	30-00:00:00	512	1024	For serial long jobs 720 hrs (30 Days) time limit
parallel-long	30	cn[35-64]	4-00:00:00	960	1920	For parallel long jobs 96 hrs (4 Days) time limit

NOTE: devel is the default partition

Useful commands

Slurm Command	Description	Syntex
sbatch	Submit a batch serial or parallel job using slurm submit script	sbatch slurm_submit_script.sub
srun	Run a script or application interactively	srun --pty -p test -t 10 --mem 1000 /bin/bash [script or app]
scancel	Kill a job by job id number	scancel 999999
squeue	View status of your jobs	squeue -u OR squeue -l
sinfo	View the cluster nodes, partitions and node status information	sinfo OR sinfo -lNe
sacct	Check current job by id number	sacct -j 999999

Usage Guidelines

Users are supposed to submit jobs only through the scheduler.
Users are not supposed to run any job on the master node.
Users are not allowed to run a job by direct login to any compute node.

Contra Cluster

Job Scheduler(SLURM):

Queuing System:

Useful commands

Usage Guidelines