Job Scheduler(SLURM):
Slurm (Simple Linux Utility for Resource Management) is an open-source job scheduler that allocates compute resources on clusters for queued researcher defined jobs. Slurm has been deployed at various national and international computing centers, and by approximately 60% of the TOP500 supercomputers in the world.
You can learn more about SLURM and its commands from the official Slurm website.
Queuing System:
When a job is submitted, it is placed in a queue. There are different queues available for different purposes. The user must select any one of the queues from the ones listed below which is appropriate for his/her computation need.
Slurm partitions are essentially different queues that point to collections of nodes. On Sonic there is one partition:
- long: this partition has 14 compute nodes that have been set aside for running the longer jobs. This partition has no time limits.
| Queue name | No. of Nodes | Node list | Default walltime (day-hrs:min) | Total No. of CPUs (threads) | 
|---|---|---|---|---|
| long | 14 | sonic[1-14] | no limit(inf) | 1344 | 
Sonic cluster doesn't have a "devel" partition and all nodes are in single partition "long"
Useful commands
| Slurm Command | Description | Syntex | 
|---|---|---|
| sbatch | Submit a batch serial or parallel job using slurm submit script | sbatch slurm_submit_script.sub | 
| srun | Run a script or application interactively | srun --pty -p test -t 10 --mem 1000 /bin/bash [script or app] | 
| scancel | Kill a job by job id number | scancel 999999 | 
| squeue | View status of your jobs | squeue -u | 
| sinfo | View the cluster nodes, partitions and node status information | sinfo OR sinfo -lNe | 
| sacct | Check current job by id number | sacct -j 999999 | 
Usage Guidelines
- Users are supposed to submit jobs only through the scheduler.
- Users are not supposed to run any job on the master node.
- Users are not allowed to run a job by direct login to any compute node.