Hpc cluster engineer / linux specialist

Zaventem

Publiée le 10 juin

Description de l'offre

Role Overview

We are hiring a Linux infrastructure specialist to join a core High-Performance Computing (HPC) team. You will support and evolve simulation and R&D compute platforms used by internal Engineering teams.

This is a technically broad role where you’ll touch on Linux infrastructure, cluster orchestration, automation, monitoring, and some hands-on hardware support. You'll join a small, senior team with high autonomy and end-to-end responsibility over the HPC platform.

Key Responsibilities

* Administer and optimize Linux-based HPC clusters (Ubuntu, CentOS, RHEL-family)
* Manage workload scheduling with Slurm
* Support containerized workloads using Docker and Singularity
* Implement and manage infrastructure-as-code via Ansible and Terraform
* Support GPU-accelerated workloads (NVIDIA, CUDA)
* Monitor system health and performance using Grafana, Prometheus, and related tools
* Troubleshoot hardware and perform physical support tasks (rack/stack, diagnostics, cabling)
* Collaborate with internal researchers and engineers to support and improve workload performance
* Contribute to documentation and help mature internal platform standards and practices

Requirements

- Operating Systems: Ubuntu, CentOS, RHEL derivatives (Rocky, Alma)

- Schedulers: Slurm (primary), OpenOnDemand (optional)

- Containers: Docker, Singularity

- Automation: Ansible, Terraform, Bash, Python

- Monitoring: Grafana, Prometheus, custom metrics

- HPC Filesystems: Lustre (required), GPFS, Ceph (optional)

- Hardware: Server maintenance, rack/stack, troubleshooting

- Collaboration: Git, Jira, CI/CD pipelines

Ideal Candidate Profile

* 5+ years of Linux system administration experience, including in performance-sensitive environments
* Experience supporting or operating HPC clusters (Slurm, Lustre)
* Scripting ability in Bash and Python
* Hands-on automation experience with Ansible and Terraform (or equivalents)
* Familiarity with containerization and job isolation (Docker/Singularity)
* Comfortable with infrastructure observability tools and performance tuning
* Proactive, autonomous, and able to collaborate across teams and functions
* Fluent in English (spoken and written)

Nice to Have

* Experience with Bright Cluster Manager or other cluster deployment tools
* Exposure to distributed file systems (e.g., Ceph)
* Familiarity with OpenOnDemand or other HPC frontend tools
* Understanding of GPU scheduling (CUDA/NVIDIA)
* Cloud exposure (AWS, Azure, or GCP)

Benefits

While preferably we are looking for a Full-Time Employee (FTE), exceptions can be made for the right candidate if they would rather work as a freelancer (Contractor).

Here is a list of benefits:

- Meal vouchers

- Pension scheme (2%)
- Hospitalization Insurance
- Remote work allowance (60€/month)

Postuler

Créer une alerte

Sauvegarder