Components Overview¶
This page provides a technical breakdown of each component in the BBC News ETL pipeline.
Components¶
-
Primary Producer & Scrapers: Generate work queue, scrape BBC News articles, deduplicate, and push tasks to RabbitMQ.
-
Message Queue (RabbitMQ): Manages Work Queue, Task Queue, and DLQ for decoupling and reliability.
-
Consumers (ETL Workers): Fetch tasks, parse and transform raw HTML into structured datasets, store in MongoDB & PostgreSQL, and handle DLQ messages.
-
Storage Layer:
- MongoDB: raw/unstructured HTML.
- PostgreSQL: cleaned, structured analytics-ready data.
-
Monitoring & Observability: Prometheus metrics, Grafana dashboards, Loki logs (via Promtail).
-
Deployment & Orchestration: Docker, Kubernetes + KEDA for autoscaling, Helm charts, CI/CD pipelines.
1. Producer (Scraper)¶
-
Purpose: Generate work queue, scrape live and archived news articles, deduplicate against MongoDB, and publish tasks to RabbitMQ.
-
Implementation:
- Python 3.11 +
requests
,BeautifulSoup4
,Selenium
,pika
. - Handles rate limiting, retries, and error logging via Promtail.
- Python 3.11 +
-
Key Features:
- Modular scrapers supporting multiple sections.
- Work Queue orchestration based on dates and statistical thresholds.
- Horizontally scalable via KEDA.
2. Message Queue (RabbitMQ)¶
-
Purpose: Decouple producers and consumers while ensuring reliable task delivery.
-
Implementation:
- Maintains Work Queue, Task Queue, DLQ.
- Durable queues with prefetch, acknowledgements, retries.
-
Key Features:
- Supports multiple producers and consumers.
- Fault-tolerant messaging with DLQ for failed ETL tasks.
- Prometheus metrics for queue depth, throughput, and failures.
3. Consumer (ETL Worker)¶
-
Purpose: Transform raw HTML into structured datasets and load them into MongoDB/PostgreSQL.
-
Implementation:
- Python ETL scripts +
pandas
,sqlalchemy
,pymongo
. - Retry and DLQ integration for failed messages.
- Python ETL scripts +
-
Key Features:
- Parallel and scalable processing via KEDA.
- Metrics exposed to Prometheus; logs sent via Promtail → Loki.
- Handles failed ETL tasks with DLQ for manual inspection.
4. Storage Layer¶
-
MongoDB: Raw/unstructured HTML storage (data lake).
-
PostgreSQL: Cleaned, analytics-ready datasets.
-
Features:
- Versioned datasets for reproducibility.
- Optimized for query performance and downstream analytics.
5. Observability¶
- Prometheus: Collects metrics (queue depth, processing rates, error counts).
- Grafana: Dashboards visualize metrics and pipeline performance.
-
Loki + Promtail: Centralized logging from all components for errors, retries, and debugging.
-
Features:
- Alerts for failures, bottlenecks, or anomalies.
- Real-time monitoring for production-grade deployments.
6. Deployment & Orchestration¶
- Docker: Containerizes all components.
- Kubernetes + KEDA: Autoscaling, rolling updates, health checks.
- Helm charts: Simplify environment configuration and deployment.
- CI/CD: GitHub Actions for testing and automated deployment.