Skip to content

Monitoring and debugging jobs

This page explains how to monitor running jobs, inspect completed jobs, and debug common issues on the HPC cluster.

Understanding job monitoring is essential to use resources efficiently and troubleshoot problems effectively.


Checking job status

View your active jobs

To see all your jobs currently queued or running:

squeue -u $USER

Common job states:

  • PD – Pending (waiting for resources)
  • R – Running
  • CD – Completed
  • F – Failed
  • CA – Cancelled

View details for a specific job

squeue -j <job_id>

To also see the reason why a job is pending:

squeue -j <job_id> -o "%i %t %r"

Understanding pending jobs

A job may remain in PENDING (PD) state for several reasons:

  • Requested resources are not currently available
  • Walltime request exceeds partition limits
  • Requested GPUs or memory are fully occupied
  • Fairshare or priority policies are applied

Pending jobs are normal on shared systems.
Avoid cancelling and resubmitting repeatedly, as this does not improve priority.


Cancelling jobs

If you need to stop a job:

scancel <job_id>

To cancel all your jobs:

scancel -u $USER

Use job cancellation responsibly, especially for large jobs.


Inspecting completed jobs

View job accounting information

After a job completes, use:

sacct -j <job_id>

Useful fields include:

  • Job state
  • Elapsed time
  • Requested vs used resources
  • Exit code

To show extended information:

sacct -j <job_id> --format=JobID,JobName,State,Elapsed,ReqMem,MaxRSS,AllocCPUS

Checking resource usage

Monitoring resource usage helps you request appropriate resources in future jobs.

Memory usage

Compare:

  • ReqMem (requested memory)
  • MaxRSS (maximum memory actually used)

If MaxRSS is close to ReqMem, consider requesting more memory.
If MaxRSS is much lower, you may reduce memory requests.


CPU usage

Check whether the job efficiently used allocated CPUs:

  • Short runtimes with many CPUs may indicate over-allocation
  • Low CPU usage may suggest I/O-bound workloads

Debugging failed jobs

When a job fails, follow this checklist:

  1. Check the job state:
    sacct -j <job_id>
    
  2. Inspect standard output and error files:
    less <jobname>.out
    less <jobname>.err
    
  3. Look for common issues:
  4. Out-of-memory errors
  5. Time limit exceeded
  6. Missing files or incorrect paths
  7. Permission errors

Common error scenarios

Out-of-memory (OOM)

Symptoms:

  • Job ends unexpectedly or is killed
  • Error message indicates memory exhaustion

Solution:

  • Increase --mem
  • Reduce input size or parallelism

Time limit exceeded

Symptoms:

  • Job is cancelled at the walltime limit

Solution:

  • Increase --time
  • Optimize the workflow
  • Split the job into smaller steps

File not found or permission denied

Symptoms:

  • Errors related to missing files or access rights

Solution:

  • Verify paths
  • Check filesystem permissions
  • Ensure files are accessible from compute nodes

Monitoring GPU jobs

For GPU jobs, you can check GPU visibility inside the job:

nvidia-smi

Use this command only inside a GPU job allocation.


Good monitoring practices

  • Monitor jobs shortly after submission
  • Check outputs regularly for long-running jobs
  • Adjust resource requests based on actual usage
  • Keep records of job performance for reproducibility

When to ask for help

Contact support if:

  • Jobs consistently fail without clear errors
  • Jobs remain pending for unusually long times
  • You suspect system-level issues

Provide:

  • Job ID
  • Job script
  • Relevant output and error messages