Troubleshooting Runbooks

This section catalogues common operational issues and recovery steps for RAG Loom deployments.

Service Does Not Start

Inspect logs:
```
docker compose logs -f rag-service
```
Verify dependencies:
- Vector store container is running.
- LLM provider credentials are valid.
Restart the service:
```
docker compose restart rag-service
```

Confirm the service port (default 8000) is free: lsof -i:8000.
Validate .env configuration (missing credentials often surface here).
For provider outages, switch to a fallback provider or reduce load until the primary recovers.

# Check running models
ollama list

# Restart the service
brew services restart ollama  # macOS
# or
docker compose restart ollama

If downloads fail, remove and re-pull the model:

ollama rm mistral:7b
ollama pull mistral:7b

lsof -i:8000
lsof -i:6333
lsof -i:6379
sudo lsof -ti:8000 | xargs sudo kill -9

Adjust exposed ports in docker-compose.yml if conflicts persist.

Review dashboards for latency spikes using the exported Prometheus metrics and any custom Grafana boards bundled with your deployment.
Increase worker processes (WORKER_PROCESSES) or scale horizontally.
Check vector store load; upgrade storage or increase cache size.
Analyse LLM provider throughput—consider queueing or request shaping.

For full redeployments:

docker compose down -v
./start_production.sh

Rebuild images after major code changes:

docker compose build --no-cache rag-service
docker compose up -d rag-service

Combine this runbook with your organisation’s incident management playbook for complete coverage.