Job Scheduler(SLURM):
Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for queued researcher-defined jobs. Slurm has been deployed at various national and international computing centers, and by approximately 60% of the TOP500 supercomputers in the world.
You can learn more about SLURM and its commands from the official Slurm website.
Queuing System:
When a job is submitted, it is placed in a queue. There are different queues available for different purposes. The user must select any one of the queues from the ones listed below which is appropriate for his/her computation need.
Slurm partitions are essentially different queues that point to collections of nodes. On Mario there are three partitions:
- post-proc: This partition has two compute node that has been set aside for post-processing and test jobs. This partition has a maximum time limit of 1 hour.
- parallel-short: This partition has 11 compute nodes that have been set aside for running only parallel smaller jobs. This queue/partition has a maximum time limit of 12 hours.
- serial-short: This partition has 8 compute nodes that have been set aside for running only longer jobs. This partition has 240 hours.
- serial-long: This partition has 7 compute nodes that have been set aside for running only serial long jobs. This partition has no time limits.
- parallel-long: This partition has 36 compute nodes that have been set aside for running only parallel longer jobs. This partition has 96 hours.
Queue name | No. of Nodes | Node list | Default walltime (day-hrs:min) | Total No. Actual CPUs | Total No. of CPUs with Hyper-Threading | About queue |
---|---|---|---|---|---|---|
post-proc | 2 | cn [1-2] | 1:00:00 | 64 | 128 | 1hr time limit, this will be for post processing, exclusively for parallel jobs |
parallel-short | 11 | cn[3-13] | 12:00:00 | 352 | 704 | 12 hrs time limit, exclusively for parallel jobs |
serial-short | 8 | cn[14-21] | 10-00:00:0 | 256 | 512 | 240 hrs time limit, exclusively for serial jobs |
serial-long | 7 | cn[22-28] | no-limit | 224 | 448 | with no time limit, exclusively for serial jobs |
parallel-long | 36 | cn[29-64] | 4-00:00:00 | 1152 | 2304 | 96 hrs time limit, exclusively for parallel jobs |
Useful commands
Slurm Command | Description | Syntex |
---|---|---|
sbatch | Submit a batch serial or parallel job using slurm submit script | sbatch slurm_submit_script.sub |
srun | Run a script or application interactively | srun --pty -p test -t 10 --mem 1000 /bin/bash [script or app] |
scancel | Kill a job by job id number | scancel 999999 |
squeue | View status of your jobs | squeue -u |
sinfo | View the cluster nodes, partitions and node status information | sinfo OR sinfo -lNe |
sacct | Check current job by id number | sacct -j 999999 |
Usage Guidelines
- Users are supposed to submit jobs only through the scheduler.
- Users are not supposed to run any job on the master node.
- Users are not allowed to run a job by direct login to any compute node.