Ollama Provider
The Ollama provider enables local LLM inference using the Ollama platform. Run models like Llama 3, Mistral, and Gemma locally without sending data to external APIs, perfect for privacy-sensitive applications and development.
Configuration
Basic Setup
Configure Ollama in your agent:
class OllamaAgent < ApplicationAgent
generate_with :ollama, model: "deepseek-r1:latest"
# @return [ActiveAgent::Generation]
def ask
prompt(message: params[:message])
end
endBasic Usage Example
response = OllamaAgent.with(
message: "What is a design pattern?"
).ask.generate_nowResponse Example
Configuration File
Set up Ollama in config/active_agent.yml:
ollama: &ollama
service: "Ollama"
model: "gemma3:latest"Environment Variables
No API keys required. Optionally configure connection settings:
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3Installing Ollama
macOS/Linux
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Start Ollama service
ollama serve
# Pull a model
ollama pull llama3Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama pull llama3Supported Models
Ollama supports a wide range of open-source models that run locally on your machine. For the complete list of available models, see Ollama's Model Library.
Popular Models
| Model | Sizes | Context Window | Best For |
|---|---|---|---|
| llama3 | 8B, 70B | 8K tokens | General purpose reasoning |
| mistral | 7B | 32K tokens | Balanced performance |
| gemma | 2B, 7B | 8K tokens | Lightweight, efficient |
| codellama | 7B, 13B, 34B | 16K tokens | Code generation and analysis |
| mixtral | 8x7B | 32K tokens | High quality, mixture of experts |
| phi | 2.7B | 2K tokens | Fast, small footprint |
| qwen | 0.5B to 72B | 32K tokens | Multilingual support |
| deepseek-r1 | 1.5B to 70B | 64K tokens | Advanced reasoning |
Recommended model identifiers:
- llama3 - Best for general use and reasoning
- codellama - Best for code-related tasks
- mistral - Best for long context understanding
Quantized Models
Ollama offers quantized versions that reduce memory usage and increase speed with minimal quality loss. For example: ollama pull qwen3:0.6b
List Installed Models
# List all locally available models
ollama list
# Pull a new model
ollama pull llama3
# Remove a model
ollama rm llama3Provider-Specific Parameters
Required Parameters
model- Model name (e.g., "llama3", "mistral")
Sampling Parameters
temperature- Controls randomness (0.0 to 1.0)top_p- Nucleus sampling parameter (0.0 to 1.0)top_k- Top-k sampling parameter (integer ≥ 0)num_predict- Maximum tokens to generateseed- For reproducible outputs (integer)stop- Array of stop sequences
System Configuration
host- Ollama server URL (default:http://localhost:11434)keep_alive- Keep model loaded in memory (e.g., "5m", "1h")timeout- Request timeout in seconds
Advanced Options
class AdvancedOllamaAgent < ApplicationAgent
generate_with :ollama,
model: "llama3",
temperature: 0.7,
options: {
num_ctx: 4096, # Context window size
num_gpu: 1, # Number of GPUs to use
num_thread: 8, # Number of threads
repeat_penalty: 1.1, # Penalize repetition
mirostat: 2, # Mirostat sampling
mirostat_tau: 5.0, # Mirostat tau parameter
mirostat_eta: 0.1 # Mirostat learning rate
}
def ask
prompt(message: params[:message])
end
endEmbeddings
embedding_model- Embedding model name (e.g., "nomic-embed-text")host- Ollama server URL for embeddings
Streaming
stream- Enable streaming responses (boolean, default: false)
Local Inference
Run models completely offline without external API calls. All inference happens on your machine without requiring an internet connection.
Privacy Benefits:
- All data stays on your machine
- No external API calls
- No internet connection required after model download
- Full control over your data
Performance Optimization
Model Loading
Keep models in memory for faster responses:
class FastOllamaAgent < ApplicationAgent
generate_with :ollama,
model: "llama3",
keep_alive: "5m" # Keep model loaded for 5 minutes
def quick_response
prompt(message: params[:query])
end
endHardware Acceleration
Configure GPU usage for better performance:
class GPUAgent < ApplicationAgent
generate_with :ollama,
model: "llama3",
options: {
num_gpu: -1, # Use all available GPUs
main_gpu: 0 # Primary GPU index
}
def ask
prompt(message: params[:message])
end
endQuantization
Use quantized models for faster inference with less memory:
# Pull quantized versions
ollama pull llama3:8b-q4_0 # 4-bit quantization
ollama pull llama3:8b-q5_1 # 5-bit quantizationclass EfficientAgent < ApplicationAgent
# Use quantized model for faster inference
generate_with :ollama, model: "qwen3:0.6b"
def ask
prompt(message: params[:message])
end
endStructured Output
Ollama does not have native structured output support. However, many models can generate JSON through careful prompting. For comprehensive structured output patterns, see the Structured Output Documentation.
Limitations
- No guaranteed JSON output - Depends on model following instructions
- No schema enforcement - Cannot guarantee specific field requirements
- Quality varies by model - Llama 3, Mixtral, and Mistral work best
- Requires validation - Always parse and validate responses
TIP
For applications requiring guaranteed schema conformance, use OpenAI with strict mode or Anthropic. For local processing, implement robust validation and error handling.
Embeddings
Generate embeddings locally using Ollama's embedding models. For comprehensive embedding usage patterns, see the Embeddings Documentation.
Available Embedding Models
| Model | Dimensions | Best For |
|---|---|---|
| nomic-embed-text | 768 | High-quality text embeddings |
| mxbai-embed-large | 1024 | Large embedding model |
| all-minilm | 384 | Lightweight embeddings |
Error Handling
Ollama-specific error handling for connection failures and missing models. For comprehensive error handling strategies, see the Error Handling Documentation.
Common Ollama Errors
Errno::ECONNREFUSED- Ollama service not running (start withollama serve)Net::OpenTimeout- Connection timeoutActiveAgent::GenerationError- Model not found or generation failure
Example
class RobustOllamaAgent < ApplicationAgent
generate_with :ollama, model: "llama3"
rescue_from ::OpenAI::Errors::APIConnectionError do |error|
Rails.logger.error "Ollama not running: #{error.message}"
"Ollama is not running. Start it with: ollama serve"
end
rescue_from StandardError do |error|
if error.message.include?("model not found")
# Pull the model if it's not found
# system("ollama pull #{generation_provider.model}")
raise error # Re-raise for this example
else
raise
end
end
def ask
prompt(message: params[:message])
end
endBest Practices
- Pre-pull models - Download models before first use:
ollama pull llama3 - Monitor memory usage - Large models require significant RAM (8GB+ recommended)
- Use appropriate models - Balance size, speed, and capability for your use case
- Keep models loaded - Use
keep_aliveparameter for frequently used models - Implement fallbacks - Handle connection failures and missing models gracefully
- Use quantization - Reduce memory usage and increase speed with quantized models
- Test locally - Ensure models work in development before deployment
- Consider GPU - Use GPU acceleration for better performance with larger models
Related Documentation
- Streaming - Real-time response streaming patterns
- Embeddings Framework - Complete guide to embeddings
- Configuration - Global provider setup
- Structured Output - Structured output patterns
- Providers Overview - Provider comparison
- Configuration Guide - Setup and configuration
- Error Handling - Error handling strategies
- Ollama Documentation - Official Ollama docs
- Ollama Model Library - Available models