Skip to main content

Scaling & Performance

RAG Loom scales horizontally and vertically depending on workload characteristics. Use this guidance to size resources and tune performance.

Horizontal Scaling

Scale the FastAPI service within the docker-compose stack:

docker compose up -d --scale rag-service=3

Behind a load balancer (e.g., Nginx, Traefik), increase worker counts gradually while monitoring latency and error rates. Ensure the vector store can handle the increased concurrency—stateful backends such as Qdrant may require their own scaling strategy.

Worker Configuration

Gunicorn workers are controlled via environment variables:

VariableDescriptionRecommendation
WORKER_PROCESSESNumber of worker processesStart with CPU cores x 2
MAX_CONCURRENT_REQUESTSConcurrent request capTune to protect downstream providers
REQUEST_TIMEOUTTimeout per request (seconds)Increase for long-running generation

Vector Store Optimisation

  • Qdrant: Configure collection parameters (e.g., hnsw_config) for recall vs. latency trade-offs; enable snapshots for durability.
  • Redis: Ensure Redis is compiled with RedisJSON/RedisSearch for vector support; allocate enough memory and persistence policy.
  • Chroma: Suitable for local development; for production consider migrating to Qdrant or Redis.

LLM Provider Considerations

ProviderNotes
OllamaLocal inference; ensure ample RAM and SSD storage. Set OLLAMA_NUM_PARALLEL to control concurrency.
OpenAINetwork latency bound; implement retry and backoff. Cache embeddings when possible.
CohereSimilar to OpenAI; monitor rate limits.
Hugging FaceUse inference endpoints or host models yourself; plan for GPU utilisation.

Performance KPIs

MetricTargetNotes
P95 latency< 2 seconds for search, < 5 seconds for generateMonitor separately by endpoint
Error rate< 1%Break down by provider
Document throughput200+ documents/hourDepends on chunking strategy
Token usageTrack per providerOptimise prompt template and chunk size

Load Testing

Use tools such as Locust or k6 to simulate traffic:

locust -f tests/load/locustfile.py --host http://localhost:8000

Focus on:

  • Mixed workloads (ingest + search + generate).
  • Warm and cold cache scenarios.
  • Provider failover behaviour.

Resource Tuning

Docker Compose snippets for resource limits:

services:
rag-service:
deploy:
resources:
limits:
memory: 4G
cpus: "2.0"
reservations:
memory: 1G
ollama:
deploy:
resources:
limits:
memory: 32G

Adjust based on monitoring data—especially for Ollama, where model size directly impacts RAM requirements.

Next Steps

After tuning for performance, establish runbooks in Troubleshooting and review Security Hardening before going live.