Configuration¶

The BBC News ETL Pipeline is highly configurable via environment variables, configuration files, and Helm values. This page documents all the key configuration options for producers, consumers, message queues, storage, and observability.

1. Environment Variables¶

Environment variables allow you to customize behavior without modifying code. Some key variables include:

Producer / Scraper¶

Variable	Description	Default
`START_DATE`	Start date for scraping (YYYY-MM-DD)	`2025-01-01`
`CURRENT_DATE`	Current date for work queue generation	System date
`SCRAPE_INTERVAL`	Interval between scraping tasks (seconds)	`10`
`SELENIUM_URL`	URL to connect to Selenium WebDriver	`http://selenium:4444/wd/hub`
`RABBITMQ_HOST`	RabbitMQ hostname	`rabbitmq`
`RABBITMQ_QUEUE`	Task queue name	`task_queue`
`RABBITMQ_DLQ`	Dead Letter Queue name	`dlq_queue`

Consumer / ETL Worker¶

Variable	Description	Default
`RABBITMQ_HOST`	RabbitMQ hostname	`rabbitmq`
`TASK_QUEUE`	Name of task queue to consume	`task_queue`
`DLQ_QUEUE`	Name of DLQ	`dlq_queue`
`MONGO_URI`	MongoDB connection string	`mongodb://mongo:27017`
`POSTGRES_URI`	PostgreSQL connection string	`postgresql://postgres:5432/news`
`MAX_RETRIES`	Number of retries before sending to DLQ	`3`

Observability¶

Variable	Description	Default
`PROMETHEUS_ENDPOINT`	URL for Prometheus metrics	`/metrics`
`LOKI_URL`	Loki endpoint for logs	`http://loki:3100/loki/api/v1/push`
`LOG_LEVEL`	Logging level	`INFO`

2. Configuration Files¶

Some components support YAML/JSON configuration files for advanced settings:

config/producers.yaml Configure sections to scrape, concurrency, and retries.
config/consumers.yaml Define ETL transformation rules, batch size, and retry policies.
config/helm/values.yaml Customize deployment settings for Kubernetes (resource limits, replicas, KEDA scaling, secrets).

3. Helm Chart Configuration¶

For Kubernetes deployments:

Replicas: Scale producers and consumers based on queue depth using KEDA.
Resources: Set CPU/memory limits and requests per pod.
Secrets: Store database credentials, RabbitMQ credentials, and API keys securely.
Metrics & Logging: Configure Prometheus scraping, Grafana dashboards, and Loki endpoints.

Example snippet from values.yaml:

producers:
  replicas: 2
  seleniumUrl: http://selenium:4444/wd/hub
  startDate: 2025-01-01

consumers:
  replicas: 2
  maxRetries: 3

rabbitmq:
  host: rabbitmq
  taskQueue: task_queue
  dlqQueue: dlq_queue

4. Notes & Best Practices¶

Version control: Keep config/*.yaml files under Git for reproducibility.
Environment separation: Use different values for local, staging, and production environments.
Secrets: Never hardcode passwords or API keys; use environment variables or Kubernetes secrets.
Dynamic scaling: Configure KEDA triggers carefully to avoid over/under-provisioning producers and consumers.