Skip to content
Back to all projects
completed

Real-Time Data Pipeline Engine

ETL pipeline framework for streaming and batch data workloads with schema evolution and data quality monitoring.

Python
Apache Airflow
Apache Spark
PostgreSQL
Docker
AWS S3
View source

System Evidence

  • medium-high throughput workload planning
  • Python, Apache Airflow, Apache Spark implementation surface
  • 70% latency-focused optimization target
Traffic Target
5,000 req/sec
Latency Gain
70% lower
Reliability
99.5% uptime
Daily Data
800GB

Overview

An ETL framework that unifies streaming and batch data processing under a single orchestration layer, with a focus on observability and reliability.

Problem

Data teams often need batch and streaming workflows to share quality checks, schema handling, and recovery patterns instead of living as separate one-off pipelines.

Architecture

  • Orchestration: Apache Airflow DAGs with dynamic task generation
  • Processing: Spark clusters for heavy transformations, Python for lightweight ETL
  • Storage: S3 data lake with PostgreSQL metadata store
  • Monitoring: Custom data quality framework with anomaly detection

Engineering Tradeoffs

  • Used Airflow for workflow visibility and retries instead of hiding orchestration inside scripts
  • Kept schema metadata in PostgreSQL so pipeline changes could be inspected and audited
  • Sent heavy transforms to Spark while keeping simpler jobs in Python for operational speed

Key Achievements

  • Supports daily processing across many pipeline jobs
  • Includes automated schema evolution handling
  • Adds recovery workflows to reduce manual intervention
  • Provides built-in data lineage tracking for compliance support

Validation Focus

  • Run backfill and replay workflows to verify recovery behavior
  • Track data freshness, anomaly rates, and failed-task retries
  • Confirm lineage metadata updates when schemas evolve