Skip to main content

Ollama Integration

Ollama enables high-quality language models to run locally without third-party API calls. This guide explains how to install Ollama, connect it to RAG Loom, and optimise performance.

When to Use Ollama

  • Offline inference or strict data residency requirements.
  • Avoiding per-token API charges from hosted providers.
  • Rapid experimentation with community-maintained models before promoting to production.

If you prefer hosted providers, configure the relevant environment variables for OpenAI, Cohere, or Hugging Face instead.

System Requirements

TierCPUMemoryStorageNotes
Development4 cores16 GB20 GBSuitable for 7B models
Staging8 cores32 GB50 GBRecommended for 13B models
Production16 cores64 GB100 GB SSDSupports 34B+ models; consider dedicated hardware

Apple Silicon (M1/M2) or GPU-backed Linux servers deliver the best throughput.

Installation

macOS (Homebrew)

brew install ollama
brew services start ollama
ollama --version

Linux

curl -fsSL https://ollama.ai/install.sh | sh
ollama serve &

Docker

docker run -d --name ollama \
-p 11434:11434 \
-v ollama_data:/root/.ollama \
ollama/ollama:latest

Model Management

# List installed models
ollama list

# Download recommended options
ollama pull mistral:7b
ollama pull llama2:13b

# Remove unused models
ollama rm llama2:70b

Create custom variants with a Modelfile:

cat <<'FILE' > Modelfile
FROM mistral:7b
PARAMETER temperature 0.6
PARAMETER top_p 0.9
SYSTEM "You are a retrieval-augmented assistant. Cite sources when available."
FILE

ollama create rag-assistant -f Modelfile

Configuring RAG Loom

Update .env to point to the Ollama runtime:

LLM_PROVIDER=ollama
OLLAMA_BASE_URL=http://localhost:11434
OLLAMA_MODEL=rag-assistant
VECTOR_STORE_TYPE=chroma

Within Docker Compose, ensure the services can communicate:

services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped

rag-service:
build: .
environment:
- LLM_PROVIDER=ollama
- OLLAMA_BASE_URL=http://ollama:11434
- OLLAMA_MODEL=rag-assistant
depends_on:
- ollama

Install the Python client only if your application code calls Ollama directly:

pip install ollama

Verification

# Confirm Ollama responds
ollama run mistral:7b "Summarise RAG Loom in two sentences."

# Validate the FastAPI integration
curl http://localhost:8000/health
curl -X POST "http://localhost:8000/api/v1/generate" \
-H "Content-Type: application/json" \
-d '{
"query": "List the deployment steps",
"search_params": {"top_k": 3}
}'

Performance Tuning

  • Set concurrency: export OLLAMA_NUM_PARALLEL=2.
  • Optimise chunk size and overlap to limit prompt length.
  • Use SSD storage so large models load quickly.
  • Combine with the Scaling guide when running multiple instances.

Model selection quick reference:

ModelRAMRecommended Use
mistral:7b8–10 GBFast development iteration
llama2:13b16–20 GBBalanced accuracy and speed
codellama:34b32 GB+Code-centric knowledge bases

Troubleshooting

SymptomResolution
connection refusedEnsure the service is running (`ps aux
Model download stallsRetry ollama pull, check connectivity, or switch mirrors
High memory usageSwitch to a smaller model or decrease OLLAMA_NUM_PARALLEL
Slow inferenceReduce temperature/top-p, upgrade hardware, or lower concurrent requests

Restart the runtime with brew services restart ollama (macOS) or docker compose restart ollama on Linux.

Security Notes

  • Restrict port 11434 to trusted networks or bind to localhost.
  • Keep the Ollama binary and models up to date.
  • Snapshot downloaded models regularly so you can roll back when needed.

Once Ollama is configured, monitor its health alongside other services using the observability tooling covered in the Operations guides.