SLURM basics¶

This page introduces the basic concepts of SLURM (Simple Linux Utility for Resource Management)
and explains how jobs are executed on the HPC cluster.

SLURM is responsible for allocating compute resources efficiently and fairly among users and projects.

Why a scheduler is required¶

The HPC cluster is a shared system.

Multiple users run jobs concurrently
Resources (CPU, GPU, memory) are finite
Direct execution on compute nodes is not permitted

SLURM ensures that:

Jobs run only when resources are available
Resource usage follows project policies and priorities
The system remains stable and fair for all users

Core concepts¶

Job¶

A job is a unit of work submitted to SLURM.

A job defines:

What command or script to run
What resources are required
How long the job is expected to run

Jobs are typically submitted as shell scripts.

Partition¶

A partition is a logical grouping of compute resources.

Different partitions may be configured for:

Production workloads
Long-running jobs
Short or test jobs
GPU workloads

Each partition may enforce limits on runtime, resources, and priority.

Resources¶

When submitting a job, you request resources such as:

CPUs (number of cores)
Memory (RAM)
GPUs (if required)
Walltime (maximum runtime)

Always request realistic values: overestimating resource requirements may delay job scheduling.

Job lifecycle¶

A typical SLURM job goes through the following states:

PENDING (PD) – waiting for resources
RUNNING (R) – currently executing
COMPLETED (CD) – finished successfully
FAILED (F) – terminated with errors
CANCELLED (CA) – stopped by the user or system

Basic SLURM commands¶

Submit a job¶

sbatch myjob.sh

Submits a batch job script to the scheduler.

Check job status¶

squeue -u <username>

Displays jobs currently queued or running for your user.

Cancel a job¶

scancel <job_id>

Cancels a pending or running job.

View job history¶

sacct

Displays information about completed jobs.

SLURM job scripts¶

Jobs are typically defined in a batch script.

Example:

#!/bin/bash
#SBATCH --job-name=example
#SBATCH --output=example.out
#SBATCH --error=example.err
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=8G

echo "Hello from SLURM"
hostname

Lines starting with #SBATCH define job requirements and resource requests.

Interactive jobs¶

For debugging or exploratory work, you can request an interactive session:

srun --pty bash

Interactive jobs consume cluster resources and are subject to the same scheduling policies as batch jobs.

Good practices¶

Keep job runtimes as short as possible
Test workflows on small datasets before scaling up
Use SCRATCH or SCRATCH_LOCAL for I/O-intensive operations
Copy results to persistent storage before job completion
Clean up temporary files after execution
Monitor jobs regularly

What NOT to do¶

Do not run computational workloads on login nodes
Do not attempt to bypass SLURM to access compute nodes
Do not submit uncontrolled or infinite jobs

Next steps¶

Read Running your first job for a complete tutorial
Learn how to monitor and debug jobs effectively