Skip to content

SLURM basics

This page introduces the basic concepts of SLURM (Simple Linux Utility for Resource Management)
and explains how jobs are executed on the HPC cluster.

SLURM is responsible for allocating compute resources efficiently and fairly among users and projects.


Why a scheduler is required

The HPC cluster is a shared system.

  • Multiple users run jobs concurrently
  • Resources (CPU, GPU, memory) are finite
  • Direct execution on compute nodes is not permitted

SLURM ensures that:

  • Jobs run only when resources are available
  • Resource usage follows project policies and priorities
  • The system remains stable and fair for all users

Core concepts

Job

A job is a unit of work submitted to SLURM.

A job defines:

  • What command or script to run
  • What resources are required
  • How long the job is expected to run

Jobs are typically submitted as shell scripts.


Partition

A partition is a logical grouping of compute resources.

Different partitions may be configured for:

  • Production workloads
  • Long-running jobs
  • Short or test jobs
  • GPU workloads

Each partition may enforce limits on runtime, resources, and priority.


Resources

When submitting a job, you request resources such as:

  • CPUs (number of cores)
  • Memory (RAM)
  • GPUs (if required)
  • Walltime (maximum runtime)

Always request realistic values: overestimating resource requirements may delay job scheduling.


Job lifecycle

A typical SLURM job goes through the following states:

  • PENDING (PD) – waiting for resources
  • RUNNING (R) – currently executing
  • COMPLETED (CD) – finished successfully
  • FAILED (F) – terminated with errors
  • CANCELLED (CA) – stopped by the user or system

Basic SLURM commands

Submit a job

sbatch myjob.sh

Submits a batch job script to the scheduler.


Check job status

squeue -u <username>

Displays jobs currently queued or running for your user.


Cancel a job

scancel <job_id>

Cancels a pending or running job.


View job history

sacct

Displays information about completed jobs.


SLURM job scripts

Jobs are typically defined in a batch script.

Example:

#!/bin/bash
#SBATCH --job-name=example
#SBATCH --output=example.out
#SBATCH --error=example.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

echo "Hello from SLURM"
hostname

Lines starting with #SBATCH define job requirements and resource requests.


Interactive jobs

For debugging or exploratory work, you can request an interactive session:

srun --pty bash

Interactive jobs consume cluster resources and are subject to the same scheduling policies as batch jobs.


Good practices

  • Keep job runtimes as short as possible
  • Test workflows on small datasets before scaling up
  • Use SCRATCH or SCRATCH_LOCAL for I/O-intensive operations
  • Copy results to persistent storage before job completion
  • Clean up temporary files after execution
  • Monitor jobs regularly

What NOT to do

  • Do not run computational workloads on login nodes
  • Do not attempt to bypass SLURM to access compute nodes
  • Do not submit uncontrolled or infinite jobs

Next steps