Skip to content

FAQ

This section answers common questions and troubleshooting scenarios for the BBC News ETL pipeline.


General

Q1. What is the purpose of this pipeline? A: To provide a production-grade, scalable ETL pipeline that scrapes BBC News articles, processes them into structured datasets, and stores them for analytics and downstream applications.

Q2. Can I run this pipeline locally without Kubernetes? A: Yes. You can use the Docker Compose setup (docker-compose.yml) for local testing and development. Kubernetes is only required for production-grade deployments.


Producers (Scrapers)

Q3. Why do Producers need Selenium? A: Some BBC News pages use JavaScript rendering. Each Producer pod runs with its own Selenium container (e.g., selenium/standalone-chrome) to scrape such dynamic pages.

Q4. How does the pipeline avoid duplicate articles? A: Before publishing messages, Producers check MongoDB for existing records. If articles already exist or counts meet a statistical threshold, they are skipped.

Q5. How are scraping rates controlled? A: Scraping frequency is configurable via environment variables. Producers implement rate limiting, retries, and exponential backoff to prevent IP blocking.


RabbitMQ

Q6. What happens if RabbitMQ crashes? A: All queues are configured as durable, so messages persist even if RabbitMQ restarts.

Q7. What is the Dead Letter Queue (DLQ)? A: Messages that fail after multiple retries are redirected to the DLQ. These must be inspected and retried manually to ensure no data loss.


Consumers (ETL Workers)

Q8. What if a Consumer fails while processing a message? A: RabbitMQ will requeue the message unless it repeatedly fails, in which case it is sent to the DLQ.

Q9. Why are both MongoDB and PostgreSQL used? A:

  • MongoDB stores raw HTML and unstructured data (data lake).
  • PostgreSQL stores cleaned, analytics-ready datasets (data warehouse).

Q10. Can I scale Consumers independently? A: Yes. RabbitMQ allows multiple Consumers to process messages in parallel. With Kubernetes + KEDA, Consumers auto-scale based on queue length.


Observability

Q11. Where can I see system health and metrics? A:

  • Grafana: dashboards for queue depth, processing rate, error counts.
  • Prometheus: raw metrics scraped from Producers, Consumers, and RabbitMQ.
  • Loki: centralized logs (with Promtail as the agent).

Q12. Why am I not seeing logs in Grafana Loki? A: Check that:

  • promtail agents are running.
  • Log paths are correctly mounted.
  • Loki service is reachable from Promtail.

Deployment

Q13. How are Producers scaled automatically? A: The primary Producer creates a work queue of dates. Based on queue length, KEDA scales additional Producers to handle the load.

Q14. How can I configure environment variables for different environments (dev, staging, prod)? A: Use Helm values.yaml files or .env files in Docker Compose.

Q15. How do I rollback a failed deployment? A:

  • With Helm: helm rollback <release-name> <revision>
  • With Docker Compose: revert to the last known working image tag.

Infrastructure

Q16. Do I need cloud resources to run this? A: No. The pipeline can run locally with Docker Compose or Kubernetes (Kind/Minikube). Cloud infrastructure is optional but recommended for production.

Q17. How is infrastructure provisioned? A: Using Terraform for declarative resource management (e.g., Kubernetes clusters, databases, networking).


CI/CD

Q18. How is documentation deployed automatically? A: GitHub Actions build the MkDocs site and deploy it to GitHub Pages whenever changes are pushed to main.

Q19. How are Docker images versioned? A: Each Docker image is tagged with the Git commit SHA for traceability and reproducibility.

Q20. What if my CI pipeline fails on pre-commit checks? A: Run pre-commit run --all-files locally to fix formatting and linting before pushing your changes.