Job Scheduler(SLURM):
Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for queued researcher defined jobs. Slurm has been deployed at various national and international computing centers, and by approximately 60% of the TOP500 supercomputers in the world.
You can learn more about SLURM and its commands from the official Slurm website.
Queuing System:
When a job is submitted, it is placed in a queue. There are different queues available for different purposes. The user must select any one of the queues from the ones listed below which is appropriate for his/her computation need.
Slurm partitions are essentially different queues that point to collections of nodes. On Mario there are three partitions:
- devel: this partition has one compute node that has been set aside for testing jobs before they are submitted to the main partition i.e short, medium or long (essentially to make sure the submission scripts work). This partition has a maximum time limit of 30 minutes.
- short: this partition has 21 compute nodes that have been set aside for running the smaller jobs. This queue/partition has a maximum time limit of 36 hours.
- medium: this partition has 26 compute nodes that have been set aside for running the medium jobs. This queue/partition has a maximum time limit of 7 days.
- long: this partition has 48 compute nodes that have been set aside for running the longer jobs. This partition has no time limits.
Queue name | No. of Nodes | Node list | Default walltime (day-hrs:min) | Total No. of Actual CPUs | Total No. of CPUs with Hyper-Threading |
---|---|---|---|---|---|
devel | 1 | cn[1] | 00:30 | 16 | 32 |
short | 21 | cn[2-22] | 1-12:00 | 336 | 672 |
medium | 26 | cn[23-22] | 7-00:00 | 416 | 832 |
long | 48 | cn[49-96] | no limit(inf) | 768 | 1536 |
Useful commands
Slurm Command | Description | Syntex |
---|---|---|
sbatch | Submit a batch serial or parallel job using slurm submit script | sbatch slurm_submit_script.sub |
srun | Run a script or application interactively | srun --pty -p test -t 10 --mem 1000 /bin/bash [script or app] |
scancel | Kill a job by job id number | scancel 999999 |
squeue | View status of your jobs | squeue -u |
sinfo | View the cluster nodes, partitions and node status information | sinfo OR sinfo -lNe |
sacct | Check current job by id number | sacct -j 999999 |
Usage Guidelines
- Users are supposed to submit jobs only through the scheduler.
- Users are not supposed to run any job on the master node.
- Users are not allowed to run a job by direct login to any compute node.