Consumer (ETL Worker)¶
Consumers process task messages from RabbitMQ, perform ETL, and store results in MongoDB (raw) and PostgreSQL (cleaned). They also handle failed ETL tasks via a Dead Letter Queue (DLQ) and expose metrics/logs for observability.
Responsibilities¶
- Fetch article link messages from the Task Queue (RabbitMQ).
- Parse raw HTML into structured, analytics-ready records.
- Apply cleaning, normalization, and enrichment.
- Store:
- MongoDB → raw data (data lake)
- PostgreSQL → cleaned & analytics ready data (data warehouse)
-
Handle failed ETL tasks:
- Messages that fail ETL are routed to DLQ for manual inspection.
-
Emit metrics and logs for observability:
- Metrics → Prometheus
- Logs → Promtail → Loki
Implementation¶
- Language: Python 3.11
- Libraries:
pika
,pandas
,sqlalchemy
,pymongo
-
Features:
-
Parallel processing of messages for high throughput.
- Retry & DLQ integration for fault tolerance.
- Metrics exposed to Prometheus.
- Logs centralized via Promtail → Loki.
- Fully containerized for Kubernetes deployment.
Data Flow¶
flowchart TD
TaskQueue["Task Queue (Article Links)"] --> ConsumersNode
ConsumersNode["Multiple Consumers"] --> MongoDB["Raw Storage"]
ConsumersNode --> Postgres["Clean Storage"]
ConsumersNode -->|Fail| DLQ["Dead Letter Queue"]
ConsumersNode --> MetricsNode["Prometheus Metrics"]
ConsumersNode --> LoggerNode["Logs"]
LoggerNode --> Promtail["Promtail Agent"]
Promtail --> Loki["Loki"]
Key Highlights¶
- Parallel & scalable: KEDA spawns multiple Consumer pods based on Task Queue depth.
- Reliability: Failed ETL messages are captured in DLQ for manual recovery.
- Observability: Metrics + logs provide real-time insights into processing rates, failures, and ETL throughput.
- Production-ready: Fully containerized and ready for Kubernetes deployment.