S3 storage service overview¶
This page provides an overview of the S3-compatible storage service available on the platform
and explains when and how it should be used.
The storage service is designed to support biomedical and clinical research,
with particular attention to security, scalability, and data governance.
What this storage service is¶
The platform provides a geodistributed, S3-compatible object storage system.
Key characteristics:
- Object storage (not a traditional filesystem)
- S3 API compatibility
- Geodistributed across multiple university sites
- Designed for large datasets and persistent storage
- Suitable for collaboration and data sharing within approved projects
This storage service complements, but does not replace, the HPC filesystems
(HOME, WORK, SCRATCH).
Typical use cases¶
The S3 storage service is suitable for:
- Storage of research datasets
- Biomedical and clinical data (subject to project approval)
- Large dataset transfers (hundreds of GB to TB scale)
- Data sharing within approved research projects
- Storage of results beyond HPC job execution
It is not intended for:
- High-frequency small I/O operations
- Temporary files during job execution
- Replacing SCRATCH or WORK filesystems
Object storage vs. filesystem storage¶
It is important to understand the difference:
Object storage (S3)¶
- Data is stored as objects inside buckets
- No directories in the traditional sense
- Accessed via APIs or dedicated tools
- Optimized for scalability and durability
HPC filesystems¶
- Traditional POSIX filesystems
- Optimized for high-performance parallel I/O
- Used directly by compute jobs
In practice:
- Use HPC filesystems for computation
- Use S3 storage for dataset storage and controlled data sharing
Buckets and projects¶
Access to S3 storage is project-based.
- Each project is assigned one bucket
- Permissions are restricted to authorized users
- Buckets are logically isolated from each other
Users can only access the buckets explicitly assigned to their project.
Security and compliance¶
The storage service is operated in accordance with:
- GDPR requirements
- Institutional data protection policies
- ISO/IEC 27001-aligned information security practices
Security measures include:
- Access control based on identity and project authorization
- Logical isolation between projects
- Logging of relevant access events
- Geodistribution for resilience and availability
Users are responsible for handling data in accordance with project approvals
and applicable requirements.
Data lifecycle and retention¶
Data stored in S3 is persistent for the duration of the authorized project.
- Access to storage is granted for a defined period
- At the end of the authorization period, access may be revoked
- Data may be removed unless the authorization is renewed or extended
Users must request any extension of storage usage before the expiration of the authorization period.
Do not assume indefinite storage without prior agreement.
Performance considerations¶
S3 storage is optimized for:
- Large, sequential data transfers (few large files rather than many small files)
- High-throughput data movement
For best performance:
- Transfer large files rather than many small files
- Use multipart uploads for large datasets
- Avoid frequent overwrites of the same objects
How users access S3 storage¶
Users do not access S3 storage via a mounted filesystem.
Instead, access is provided through:
- Command-line tools (e.g. rclone, s3cmd, mc)
- Programmatic access via S3 APIs
- Controlled integrations with the HPC cluster
Details are provided in First steps with S3 storage**.
When to combine S3 storage and HPC¶
A common pattern is:
- Store raw datasets in S3 storage
- Stage required data to HPC WORK or SCRATCH
- Run computations on the HPC cluster
- Store final results back to S3
This approach balances performance and data persistence.
Next steps¶
- Read Access credentials and buckets.
- Follow First steps with S3 storage to start using the service