How do you want the processes to be distributed?
- All are on same node to reduces the network latencies
- Scatter distribution of jobs to increase overall memory bandwidth
- Even distribution of processes across nodes
- Let scheduler choose
All are on same node to reduce the network latencies:
sample script – sample.sub
#!/bin/bash # Submission script: "tasks are all grouped on same node" # Job name #SBATCH --job-name=mpi_mm # Output file name #SBATCH --output=mpi_mm_v2.out #SBATCH --error=mpi_mm_v2.err # # Set the required partition [change] #SBATCH --partition=short # Number of processes #SBATCH --ntasks=32 # Number of nodes #SBATCH --nodes=1 # Memory per process #SBATCH --mem-per-cpu=500 # # Total wall-time #SBATCH --time=00:05:00 # # The below statement is required if the code is floating-point intensive and CPU-bound [Optional] #SBATCH --threads-per-core=1 # # To get email alert [Optional] # NOTE: Remove one "#" and "write your email ID" (ex: #SBATCH --mail-user=hemanta.kumar@icts.res.in) ##SBATCH --mail-user= email id ##SBATCH --mail-type=ALL date mpirun /home/hemanta.kumar/slurm_test/mpi_mm #srun /home/hemanta.kumar/slurm_test/mpi_mm date
Scatter distribution of jobs to increase overall memory bandwidth:
sample script – sample.sub
#!/bin/bash # Submission script: "tasks are scattered across distinct nodes" # Job name #SBATCH --job-name=mpi_mm # Output file name #SBATCH --output=mpi_mm_v3.out #SBATCH --error=mpi_mm_v3.err # # Set the required partition [change] #SBATCH --partition=short # Number of processes #SBATCH --ntasks=2 #SBATCH --ntasks-per-node=1 # Memory per process #SBATCH --mem-per-cpu=500 # # Total wall-time #SBATCH --time=00:05:00 # # The below statement is required if the code is floating-point intensive and CPU-bound [Optional] #SBATCH --threads-per-core=1 # # To get email alert [Optional] # NOTE: Remove one "#" and "write your email ID" (ex: #SBATCH --mail-user=hemanta.kumar@icts.res.in) ##SBATCH --mail-user= email id ##SBATCH --mail-type=ALL # date mpirun /home/hemanta.kumar/slurm_test/mpi_mm #srun /home/hemanta.kumar/slurm_test/mpi_mm date
Even distribution of processes across nodes:
sample script – sample.sub
#!/bin/bash # Submission script: "tasks are evenly distributed across nodes" # Job name #SBATCH --job-name=mpi_mm # Output file name #SBATCH --output=mpi_mm_v1.out #SBATCH --error=mpi_mm_v1.err # # Set the required partition [change] #SBATCH --partition=short # Number of processes #SBATCH --ntasks=32 # Process distribution per node #SBATCH --ntasks-per-node=8 # Number of nodes #SBATCH --nodes=4 # Memory per process #SBATCH --mem-per-cpu=500 # # Total wall-time #SBATCH --time=00:05:00 # # The below statement is required if the code is floating-point intensive and CPU-bound [Optional] #SBATCH --threads-per-core=1 # # To get email alert [Optional] # NOTE: Remove one "#" and "write your email ID" (ex: #SBATCH --mail-user=hemanta.kumar@icts.res.in) ##SBATCH --mail-user= email id ##SBATCH --mail-type=ALL # date mpirun /home/hemanta.kumar/slurm_test/mpi_mm #srun /home/hemanta.kumar/slurm_test/mpi_mm date
Let scheduler choose:
sample script – sample.sub
#!/bin/bash # Submission script: "no plan" # Job name #SBATCH --job-name=mpi_mm # Output file name #SBATCH --output=mpi_mm_v4.out #SBATCH --error=mpi_mm_v4.err # # Set the required partition [change] #SBATCH --partition=short # Number of processes #SBATCH --ntasks=64 # Memory per process #SBATCH --mem-per-cpu=500 # # Total wall-time #SBATCH --time=00:05:00 # # The below statement is required if the code is floating-point intensive and CPU-bound [Optional] #SBATCH --threads-per-core=1 # # To get email alert [Optional] # NOTE: Remove one "#" and "write your email ID" (ex: #SBATCH --mail-user=hemanta.kumar@icts.res.in) ##SBATCH --mail-user= email id ##SBATCH --mail-type=ALL # date mpirun /home/hemanta.kumar/slurm_test/mpi_mm #srun /home/hemanta.kumar/slurm_test/mpi_mm date
Submit job:
sbatch sample.sub
The job’s status in the queue can be monitored with squeue; (add -u username to focus on a particular user’s jobs).
The job can be deleted with scancel <job_id> .
When the job finishes (in error or correctly) there will normally be one file created in the submission directory with the name of the form slurm-NNNN.out (where NNNN is the job id).