FAQ¶
This section answers common questions and troubleshooting scenarios for the BBC News ETL pipeline.
General¶
Q1. What is the purpose of this pipeline? A: To provide a production-grade, scalable ETL pipeline that scrapes BBC News articles, processes them into structured datasets, and stores them for analytics and downstream applications.
Q2. Can I run this pipeline locally without Kubernetes?
A: Yes. You can use the Docker Compose setup (docker-compose.yml
) for local testing and development. Kubernetes is only required for production-grade deployments.
Producers (Scrapers)¶
Q3. Why do Producers need Selenium?
A: Some BBC News pages use JavaScript rendering. Each Producer pod runs with its own Selenium container (e.g., selenium/standalone-chrome
) to scrape such dynamic pages.
Q4. How does the pipeline avoid duplicate articles? A: Before publishing messages, Producers check MongoDB for existing records. If articles already exist or counts meet a statistical threshold, they are skipped.
Q5. How are scraping rates controlled? A: Scraping frequency is configurable via environment variables. Producers implement rate limiting, retries, and exponential backoff to prevent IP blocking.
RabbitMQ¶
Q6. What happens if RabbitMQ crashes? A: All queues are configured as durable, so messages persist even if RabbitMQ restarts.
Q7. What is the Dead Letter Queue (DLQ)? A: Messages that fail after multiple retries are redirected to the DLQ. These must be inspected and retried manually to ensure no data loss.
Consumers (ETL Workers)¶
Q8. What if a Consumer fails while processing a message? A: RabbitMQ will requeue the message unless it repeatedly fails, in which case it is sent to the DLQ.
Q9. Why are both MongoDB and PostgreSQL used? A:
- MongoDB stores raw HTML and unstructured data (data lake).
- PostgreSQL stores cleaned, analytics-ready datasets (data warehouse).
Q10. Can I scale Consumers independently? A: Yes. RabbitMQ allows multiple Consumers to process messages in parallel. With Kubernetes + KEDA, Consumers auto-scale based on queue length.
Observability¶
Q11. Where can I see system health and metrics? A:
- Grafana: dashboards for queue depth, processing rate, error counts.
- Prometheus: raw metrics scraped from Producers, Consumers, and RabbitMQ.
- Loki: centralized logs (with Promtail as the agent).
Q12. Why am I not seeing logs in Grafana Loki? A: Check that:
promtail
agents are running.- Log paths are correctly mounted.
- Loki service is reachable from Promtail.
Deployment¶
Q13. How are Producers scaled automatically? A: The primary Producer creates a work queue of dates. Based on queue length, KEDA scales additional Producers to handle the load.
Q14. How can I configure environment variables for different environments (dev, staging, prod)?
A: Use Helm values.yaml
files or .env
files in Docker Compose.
Q15. How do I rollback a failed deployment? A:
- With Helm:
helm rollback <release-name> <revision>
- With Docker Compose: revert to the last known working image tag.
Infrastructure¶
Q16. Do I need cloud resources to run this? A: No. The pipeline can run locally with Docker Compose or Kubernetes (Kind/Minikube). Cloud infrastructure is optional but recommended for production.
Q17. How is infrastructure provisioned? A: Using Terraform for declarative resource management (e.g., Kubernetes clusters, databases, networking).
CI/CD¶
Q18. How is documentation deployed automatically?
A: GitHub Actions build the MkDocs site and deploy it to GitHub Pages whenever changes are pushed to main
.
Q19. How are Docker images versioned? A: Each Docker image is tagged with the Git commit SHA for traceability and reproducibility.
Q20. What if my CI pipeline fails on pre-commit
checks?
A: Run pre-commit run --all-files
locally to fix formatting and linting before pushing your changes.