Monitoring and debugging jobs¶

This page explains how to monitor running jobs, inspect completed jobs, and debug common issues on the HPC cluster.

Understanding job monitoring is essential to use resources efficiently and troubleshoot problems effectively.

Checking job status¶

View your active jobs¶

To see all your jobs currently queued or running:

squeue -u $USER

Common job states:

PD – Pending (waiting for resources)
R – Running
CD – Completed
F – Failed
CA – Cancelled

View details for a specific job¶

squeue -j <job_id>

To also see the reason why a job is pending:

squeue -j <job_id> -o "%i %t %r"

Understanding pending jobs¶

A job may remain in PENDING (PD) state for several reasons:

Requested resources are not currently available
Walltime request exceeds partition limits
Requested GPUs or memory are fully occupied
Fairshare or priority policies are applied

Pending jobs are normal on shared systems.
Avoid cancelling and resubmitting repeatedly, as this does not improve priority.

Cancelling jobs¶

If you need to stop a job:

scancel <job_id>

To cancel all your jobs:

scancel -u $USER

Use job cancellation responsibly, especially for large jobs.

Inspecting completed jobs¶

View job accounting information¶

After a job completes, use:

sacct -j <job_id>

Useful fields include:

Job state
Elapsed time
Requested vs used resources
Exit code

To show extended information:

sacct -j <job_id> --format=JobID,JobName,State,Elapsed,ReqMem,MaxRSS,AllocCPUS

Checking resource usage¶

Monitoring resource usage helps you request appropriate resources in future jobs.

Memory usage¶

Compare:

ReqMem (requested memory)
MaxRSS (maximum memory actually used)

If MaxRSS is close to ReqMem, consider requesting more memory.
If MaxRSS is much lower, you may reduce memory requests.

CPU usage¶

Check whether the job efficiently used allocated CPUs:

Short runtimes with many CPUs may indicate over-allocation
Low CPU usage may suggest I/O-bound workloads

Debugging failed jobs¶

When a job fails, follow this checklist:

Check the job state:
```
sacct -j <job_id>
```
Inspect standard output and error files:
```
less <jobname>.out
less <jobname>.err
```
Look for common issues:
Out-of-memory errors
Time limit exceeded
Missing files or incorrect paths
Permission errors

Common error scenarios¶

Out-of-memory (OOM)¶

Symptoms:

Job ends unexpectedly or is killed
Error message indicates memory exhaustion

Solution:

Increase --mem
Reduce input size or parallelism

Time limit exceeded¶

Symptoms:

Job is cancelled at the walltime limit

Solution:

Increase --time
Optimize the workflow
Split the job into smaller steps

File not found or permission denied¶

Symptoms:

Errors related to missing files or access rights

Solution:

Verify paths
Check filesystem permissions
Ensure files are accessible from compute nodes

Monitoring GPU jobs¶

For GPU jobs, you can check GPU visibility inside the job:

nvidia-smi

Use this command only inside a GPU job allocation.

Good monitoring practices¶

Monitor jobs shortly after submission
Check outputs regularly for long-running jobs
Adjust resource requests based on actual usage
Keep records of job performance for reproducibility

When to ask for help¶

Contact support if:

Jobs consistently fail without clear errors
Jobs remain pending for unusually long times
You suspect system-level issues

Provide:

Job ID
Job script
Relevant output and error messages