Message Queue (RabbitMQ)¶
The Message Queue acts as the backbone of the pipeline, decoupling producers from consumers and ensuring reliable, ordered, and durable delivery of scraping and ETL tasks.
Responsibilities¶
-
Maintain Work Queue for Producers:
- Stores date-based tasks generated by the primary producer.
- Enables multiple Producers (scaled by KEDA) to pick pending dates and scrape them.
-
Maintain Task Queue for Consumers:
- Stores scraped article links + metadata.
- Allows multiple Consumers to process tasks in parallel for ETL into storage systems.
-
Provide Dead Letter Queue (DLQ):
- Captures failed ETL messages for manual inspection and replay.
-
Ensure -
- Durable storage of messages (no data loss).
- Acknowledgements for reliable message processing.
- Retry & requeue support for transient failures.
- M-trics integration for observability.
Implementation¶
-
RabbitMQ with:
- Durable queues (Work Queue, Task Queue, DLQ).
- Prefetch settings to prevent consumer overload.
- Acknowledgements for guaranteed delivery.
- Integrated with Prometheus RabbitMQ Exporter for queue-level metrics.
Data Flow¶
flowchart TD
ProducerInit --> WorkQueue
WorkQueue[Work Queue] --> RabbitMQNode
RabbitMQNode[(RabbitMQ)] --> MultiProducers
MultiProducers[Multiple Producers] --> TaskQueue
TaskQueue[Task Queue] --> RabbitMQNode
RabbitMQNode --> ConsumersNode
ConsumersNode[Consumers - ETL Workers] --> DLQ
DLQ[Dead Letter Queue]
RabbitMQNode --> MetricsNode
MetricsNode[Prometheus Exporter]
Key Highlights¶
-
Two-tier queueing system:
- Work Queue (date orchestration) → drives Producers.
- Task Queue (article links) → drives Consumers.
-
Autoscaling with KEDA: queue depth automatically adjusts number of Producers and Consumers.
-
Reliability: DLQ captures all failures for manual recovery.
-
Observability: Queue depth, processing rates, and failures are exposed as Prometheus metrics.