HPC service overview¶
The HPC service provides a shared computing environment for running computationally intensive workloads in a secure and controlled manner.
The cluster follows a standard HPC architecture, separating user access nodes from compute resources and using a batch scheduler to manage workloads.
Architecture at a glance¶
The cluster consists of:
-
Login node
- Entry point for users
- Used for code editing, compilation, and job submission
- Not intended for intensive or long-running computations
-
Compute nodes
- CPU nodes for general-purpose workloads
- GPU nodes for accelerated workloads (e.g. AI, deep learning)
- Accessible only through the scheduler
- High-performance local scratch space (NVMe) available on compute nodes, intended for fast I/O during job execution
- Local scratch storage is temporary and data may be removed after job completion
-
Shared storage
- Home directories for user data
- Project-level work areas
- Shared scratch space for temporary data.
-
Control and management nodes
- Host system services (e.g. scheduler, monitoring, management)
- Not accessible to users
- Ensure operation and coordination of the platform
Scheduler¶
All compute resources are managed by SLURM (Simple Linux Utility for Resource Management).
Users submit jobs specifying their resource requirements, and SLURM allocates resources based on availability and scheduling policies.
Direct execution of computational workloads on the login node is not permitted.
Typical workflow¶
A typical workflow on the HPC system is:
- Connect to the login node
- Prepare code, scripts, and input data
- Submit jobs using SLURM
- Monitor job execution
- Retrieve results from project storage