SLURM basics¶
This page introduces the basic concepts of SLURM (Simple Linux Utility for Resource Management)
and explains how jobs are executed on the HPC cluster.
SLURM is responsible for allocating compute resources efficiently and fairly among users and projects.
Why a scheduler is required¶
The HPC cluster is a shared system.
- Multiple users run jobs concurrently
- Resources (CPU, GPU, memory) are finite
- Direct execution on compute nodes is not permitted
SLURM ensures that:
- Jobs run only when resources are available
- Resource usage follows project policies and priorities
- The system remains stable and fair for all users
Core concepts¶
Job¶
A job is a unit of work submitted to SLURM.
A job defines:
- What command or script to run
- What resources are required
- How long the job is expected to run
Jobs are typically submitted as shell scripts.
Partition¶
A partition is a logical grouping of compute resources.
Different partitions may be configured for:
- Production workloads
- Long-running jobs
- Short or test jobs
- GPU workloads
Each partition may enforce limits on runtime, resources, and priority.
Resources¶
When submitting a job, you request resources such as:
- CPUs (number of cores)
- Memory (RAM)
- GPUs (if required)
- Walltime (maximum runtime)
Always request realistic values: overestimating resource requirements may delay job scheduling.
Job lifecycle¶
A typical SLURM job goes through the following states:
- PENDING (PD) – waiting for resources
- RUNNING (R) – currently executing
- COMPLETED (CD) – finished successfully
- FAILED (F) – terminated with errors
- CANCELLED (CA) – stopped by the user or system
Basic SLURM commands¶
Submit a job¶
sbatch myjob.sh
Submits a batch job script to the scheduler.
Check job status¶
squeue -u <username>
Displays jobs currently queued or running for your user.
Cancel a job¶
scancel <job_id>
Cancels a pending or running job.
View job history¶
sacct
Displays information about completed jobs.
SLURM job scripts¶
Jobs are typically defined in a batch script.
Example:
#!/bin/bash
#SBATCH --job-name=example
#SBATCH --output=example.out
#SBATCH --error=example.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G
echo "Hello from SLURM"
hostname
Lines starting with #SBATCH define job requirements and resource requests.
Interactive jobs¶
For debugging or exploratory work, you can request an interactive session:
srun --pty bash
Interactive jobs consume cluster resources and are subject to the same scheduling policies as batch jobs.
Good practices¶
- Keep job runtimes as short as possible
- Test workflows on small datasets before scaling up
- Use SCRATCH or SCRATCH_LOCAL for I/O-intensive operations
- Copy results to persistent storage before job completion
- Clean up temporary files after execution
- Monitor jobs regularly
What NOT to do¶
- Do not run computational workloads on login nodes
- Do not attempt to bypass SLURM to access compute nodes
- Do not submit uncontrolled or infinite jobs
Next steps¶
- Read Running your first job for a complete tutorial
- Learn how to monitor and debug jobs effectively