Monitoring and debugging jobs¶
This page explains how to monitor running jobs, inspect completed jobs, and debug common issues on the HPC cluster.
Understanding job monitoring is essential to use resources efficiently and troubleshoot problems effectively.
Checking job status¶
View your active jobs¶
To see all your jobs currently queued or running:
squeue -u $USER
Common job states:
- PD – Pending (waiting for resources)
- R – Running
- CD – Completed
- F – Failed
- CA – Cancelled
View details for a specific job¶
squeue -j <job_id>
To also see the reason why a job is pending:
squeue -j <job_id> -o "%i %t %r"
Understanding pending jobs¶
A job may remain in PENDING (PD) state for several reasons:
- Requested resources are not currently available
- Walltime request exceeds partition limits
- Requested GPUs or memory are fully occupied
- Fairshare or priority policies are applied
Pending jobs are normal on shared systems.
Avoid cancelling and resubmitting repeatedly, as this does not improve priority.
Cancelling jobs¶
If you need to stop a job:
scancel <job_id>
To cancel all your jobs:
scancel -u $USER
Use job cancellation responsibly, especially for large jobs.
Inspecting completed jobs¶
View job accounting information¶
After a job completes, use:
sacct -j <job_id>
Useful fields include:
- Job state
- Elapsed time
- Requested vs used resources
- Exit code
To show extended information:
sacct -j <job_id> --format=JobID,JobName,State,Elapsed,ReqMem,MaxRSS,AllocCPUS
Checking resource usage¶
Monitoring resource usage helps you request appropriate resources in future jobs.
Memory usage¶
Compare:
- ReqMem (requested memory)
- MaxRSS (maximum memory actually used)
If MaxRSS is close to ReqMem, consider requesting more memory.
If MaxRSS is much lower, you may reduce memory requests.
CPU usage¶
Check whether the job efficiently used allocated CPUs:
- Short runtimes with many CPUs may indicate over-allocation
- Low CPU usage may suggest I/O-bound workloads
Debugging failed jobs¶
When a job fails, follow this checklist:
- Check the job state:
sacct -j <job_id> - Inspect standard output and error files:
less <jobname>.out less <jobname>.err - Look for common issues:
- Out-of-memory errors
- Time limit exceeded
- Missing files or incorrect paths
- Permission errors
Common error scenarios¶
Out-of-memory (OOM)¶
Symptoms:
- Job ends unexpectedly or is killed
- Error message indicates memory exhaustion
Solution:
- Increase
--mem - Reduce input size or parallelism
Time limit exceeded¶
Symptoms:
- Job is cancelled at the walltime limit
Solution:
- Increase
--time - Optimize the workflow
- Split the job into smaller steps
File not found or permission denied¶
Symptoms:
- Errors related to missing files or access rights
Solution:
- Verify paths
- Check filesystem permissions
- Ensure files are accessible from compute nodes
Monitoring GPU jobs¶
For GPU jobs, you can check GPU visibility inside the job:
nvidia-smi
Use this command only inside a GPU job allocation.
Good monitoring practices¶
- Monitor jobs shortly after submission
- Check outputs regularly for long-running jobs
- Adjust resource requests based on actual usage
- Keep records of job performance for reproducibility
When to ask for help¶
Contact support if:
- Jobs consistently fail without clear errors
- Jobs remain pending for unusually long times
- You suspect system-level issues
Provide:
- Job ID
- Job script
- Relevant output and error messages