Careers
/
Data Engineer

Data Engineer

As a Senior Data Engineer, you will design, build, and maintain scalable data pipelines and workflows to support our growing data ecosystem. You will focus on creating production-ready ETL processes using Apache Airflow, integrating with diverse data stores, and ensuring all code meets rigorous development standards, including peer review, scalability, and comprehensive test coverage. The ideal candidate is a proficient developer who treats data engineering as software engineering, with hands-on experience in RAG (Retrieval-Augmented Generation) pipelines and a track record of delivering reliable, maintainable systems.

Key Responsibilities

  • Develop and optimize ETL pipelines using Apache Airflow to ingest, transform, and load data from various sources into target systems.
  • Implement production-ready code for data workflows, ensuring scalability, fault tolerance, and adherence to best practices such as modular design, error handling, and automated testing (unit, integration, and end-to-end).
  • Collaborate with data scientists, analysts, and engineering teams to build and maintain RAG pipelines that enhance AI/ML applications with accurate, context-aware data retrieval.
  • Participate in code reviews to enforce high coding standards, promote clean, readable code, and integrate CI/CD practices for automated testing and deployment.
  • Monitor and troubleshoot data pipelines for performance, reliability, and data quality, implementing observability tools to detect and resolve issues proactively.
  • Design and optimize data storage solutions, integrating with relational and NoSQL databases to support real-time and batch processing needs.

Required Qualifications

  • Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
  • 5+ years of hands-on experience as a Data Engineer or in a similar role, with a proven background as a strong developer (e.g., proficiency in Python, SQL, and related languages).
  • Excellent proficiency with Apache Airflow for orchestrating complex ETL workflows, including DAG creation, scheduling, and dependency management.
  • Demonstrated experience building scalable ETL pipelines that handle large datasets, with a focus on production-ready implementation including comprehensive test coverage (e.g., using pytest or similar frameworks).
  • Strong emphasis on software engineering practices: Experience with peer code reviews, version control (e.g., Git), and ensuring code is modular, documented, and scalable to prevent common pitfalls like brittle or unmaintainable pipelines.
  • Familiarity with data modeling, transformation, and integration in distributed environments.
  • Excellent problem-solving skills and the ability to work in a fast-paced, collaborative environment.

Preferred Qualifications

  • Experience with RAG pipelines, including vector databases and embedding techniques for AI-driven applications.
  • Hands-on experience with databases such as PostgreSQL (for relational data), StarRocks (for analytical workloads), Cassandra or ScyllaDB (for high-throughput NoSQL), and Qdrant (for vector search).
  • Knowledge of cloud data services (e.g., AWS Glue, Azure Data Factory) and orchestration tools beyond Airflow.
  • Familiarity with monitoring and observability tools like Prometheus or OpenSearch for data pipeline health.
  • Certifications in relevant technologies (e.g., Apache Airflow)
Explore other roles
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.