DESCRIPTION OF THE TASKS:
Following tasks will be performed by the external service provider:
* Development and maintenance of a fully open-source Data Lakehouse.
* Design and development of data pipelines for scalable and reliable data workflows to transform extensive quantities of both structured and unstructured data.
* Data integration from various sources, including databases, APIs, data streaming services, and cloud data platforms.
* Optimisation of queries and workflows for increased performance and enhanced efficiency.
* Writing modular, testable, and production-grade code.
* Ensuring data quality through monitoring, validation, and data quality checks, maintaining accuracy and consistency across the data platform.
* Elaboration of test programs.
* Document processes comprehensively to ensure seamless data pipeline management and troubleshooting.
* Assistance with deployment and configuration of the system.
* Participation in meetings with other project teams.
KNOWLEDGE AND SKILLS:
Following skills and knowledge are required for the performance of the above-listed tasks:
* Extensive hands-on experience as Data Engineer or Data Architect in modern cloud-based open-source data platform solutions and on data analytics tools.
* Excellent knowledge of data warehouse and/or data lakehouse design & architecture.
* Excellent knowledge of open-source, code-based data transformation tools such as dbt, Spark, and Trino.
* Excellent knowledge of SQL.
* Good knowledge of Python.
* Good knowledge of open-source orchestration tools such as Airflow, Dagster, or Luigi.
* Experience with AI-powered assistants like Amazon Q that can streamline data engineering processes.
* Good knowledge of relational database systems.
* Good knowledge of event streaming platforms and message brokers like Kafka and RabbitMQ.
* Extensive experience in creating end-to-end data pipelines and the ELT framework.
* Understanding of the principles behind storage protocols like Apache Iceberg or Delta Lake.
* Proficiency with Kubernetes and Docker/Podman.
* Good knowledge of data modelling tools.
* Good knowledge of online analytical data processing (OLAP) and data mining tools.
* Ability to participate in multilingual meetings.
* Ability to work with a high degree of rigour and method and, more specifically, to follow naming conventions and coding standards.