We are looking for Site Reliability Engineering service in our Engineering chapter team. The goal is to ensure the reliability, scalability, monitoring, and performance of our on-premises services in the ERA product organization. Responsibilities will include designing, implementing best practices, and managing our infrastructure. The role includes working within cross-functional teams to improve systems and processes and ensure uptime and efficiency.
* Design and maintain monitoring infrastructure
* Create custom dashboards, alerts, and visualization solutions
* Implement distributed tracing and log aggregation systems
* Establish monitoring best practices and SLI/SLO frameworks
* Maintain security compliance for on-premises monitoring tools
* Automate deployment and configuration management
* Collaborate with development teams on application instrumentation
* Participate to on-duty rotations
Profil
* Core Technologies
o Advanced Grafana,
o Prometheus (PromQL),
o OpenTelemetry,
o Elasticsearch
* Infrastructure
o Linux administration,
o networking,
o on-premises security
* Programming
o Python,
o Bash, or Go for automation
* Experience
o 3+ years monitoring/observability,
o 2+ years Grafana/Prometheus in production,
o strong Linux system administration experience,
o proven track record with on-premises infrastructure solutions
* Security
o Enterprise security practices,
o compliance requirements
* Ability to balance technical trade-offs with business needs and prioritize effectively.
* Participation to on-duty rotations (24/7 Incident support)
* English (C1).
* Extra Languages: German, French, Dutch.
Informations contractuelles
* Location: Brussels (Empereur)
* Onsite presence: By default, a physical presence on site is required for 2 days per week.
* Work regime: fulltime