S3 storage service overview¶

This page provides an overview of the S3-compatible storage service available on the platform
and explains when and how it should be used.

The storage service is designed to support biomedical and clinical research,
with particular attention to security, scalability, and data governance.

What this storage service is¶

The platform provides a geodistributed, S3-compatible object storage system.

Key characteristics:

Object storage (not a traditional filesystem)
S3 API compatibility
Geodistributed across multiple university sites
Designed for large datasets and persistent storage
Suitable for collaboration and data sharing within approved projects

This storage service complements, but does not replace, the HPC filesystems
(HOME, WORK, SCRATCH).

Typical use cases¶

The S3 storage service is suitable for:

Storage of research datasets
Biomedical and clinical data (subject to project approval)
Large dataset transfers (hundreds of GB to TB scale)
Data sharing within approved research projects
Storage of results beyond HPC job execution

It is not intended for:

High-frequency small I/O operations
Temporary files during job execution
Replacing SCRATCH or WORK filesystems

Object storage vs. filesystem storage¶

It is important to understand the difference:

Object storage (S3)¶

Data is stored as objects inside buckets
No directories in the traditional sense
Accessed via APIs or dedicated tools
Optimized for scalability and durability

HPC filesystems¶

Traditional POSIX filesystems
Optimized for high-performance parallel I/O
Used directly by compute jobs

In practice:

Use HPC filesystems for computation
Use S3 storage for dataset storage and controlled data sharing

Buckets and projects¶

Access to S3 storage is project-based.

Each project is assigned one bucket
Permissions are restricted to authorized users
Buckets are logically isolated from each other

Users can only access the buckets explicitly assigned to their project.

Security and compliance¶

The storage service is operated in accordance with:

GDPR requirements
Institutional data protection policies
ISO/IEC 27001-aligned information security practices

Security measures include:

Access control based on identity and project authorization
Logical isolation between projects
Logging of relevant access events
Geodistribution for resilience and availability

Users are responsible for handling data in accordance with project approvals
and applicable requirements.

Data lifecycle and retention¶

Data stored in S3 is persistent for the duration of the authorized project.

Access to storage is granted for a defined period
At the end of the authorization period, access may be revoked
Data may be removed unless the authorization is renewed or extended

Users must request any extension of storage usage before the expiration of the authorization period.

Do not assume indefinite storage without prior agreement.

Performance considerations¶

S3 storage is optimized for:

Large, sequential data transfers (few large files rather than many small files)
High-throughput data movement

For best performance:

Transfer large files rather than many small files
Use multipart uploads for large datasets
Avoid frequent overwrites of the same objects

How users access S3 storage¶

Users do not access S3 storage via a mounted filesystem.

Instead, access is provided through:

Command-line tools (e.g. rclone, s3cmd, mc)
Programmatic access via S3 APIs
Controlled integrations with the HPC cluster

Details are provided in First steps with S3 storage**.

When to combine S3 storage and HPC¶

A common pattern is:

Store raw datasets in S3 storage
Stage required data to HPC WORK or SCRATCH
Run computations on the HPC cluster
Store final results back to S3

This approach balances performance and data persistence.

Next steps¶

Read Access credentials and buckets.
Follow First steps with S3 storage to start using the service