Skip to content

ETL Pipeline Docs

Kubernetes Manifests

bbc-news-etl-pipeline

Kubernetes Deployment¶

This guide describes deploying the BBC News ETL pipeline in production-like Kubernetes environments, including KEDA autoscaling.

Prerequisites¶

Kubernetes cluster (Minikube, Kind, or cloud provider)
Helm >= 3.x
kubectl CLI
Optional: KEDA for autoscaling

Architecture¶

Primary Producer → initializes work queue
Multiple Producers → scrape articles, deduplicate, publish tasks
RabbitMQ → message broker for Task Queue & DLQ
Consumers (ETL Workers) → fetch and process tasks
MongoDB / PostgreSQL → data storage
Prometheus / Grafana / Loki + Promtail → observability
KEDA → scales producers/consumers based on queue length

Deployment Steps¶

Clone the repo:

git clone https://github.com/Rahul-404/bbc_news_etl_pipeline.git
cd bbc_news_etl_pipeline

Install RabbitMQ, MongoDB, PostgreSQL, Prometheus, Grafana, Loki using Helm charts:

helm install rabbitmq ./helm/rabbitmq
helm install mongo ./helm/mongodb
helm install postgres ./helm/postgres
helm install prometheus ./helm/prometheus
helm install grafana ./helm/grafana
helm install loki ./helm/loki

Deploy Primary Producer, Producers, and Consumers:

kubectl apply -f k8s/producers/
kubectl apply -f k8s/consumers/

Configure KEDA for horizontal scaling:
Producers scale based on Work Queue length.
Consumers scale based on Task Queue depth.
Example KEDA ScaledObject YAML is included in k8s/keda/.
Verify Pods and Services:

kubectl get pods
kubectl get svc

Access dashboards:
Grafana: http://:3000
Prometheus: http://:9090
Loki: http://:3100

Notes¶

Each Producer pod contains its own Selenium driver for scraping.
DLQ handling is automatic: failed ETL messages remain in RabbitMQ for manual inspection.
Logging and metrics are fully integrated, ready for production-grade monitoring.
Helm charts allow environment-specific configurations via values.yaml.