Ollama Provider

The Ollama provider enables local LLM inference using the Ollama platform. Run models like Llama 3, Mistral, and Gemma locally without sending data to external APIs, perfect for privacy-sensitive applications and development.

Configuration

Basic Setup

Configure Ollama in your agent:

ruby

class OllamaAgent < ApplicationAgent
  generate_with :ollama, model: "deepseek-r1:latest"

  # @return [ActiveAgent::Generation]
  def ask
    prompt(message: params[:message])
  end
end

Basic Usage Example

ruby

response = OllamaAgent.with(
  message: "What is a design pattern?"
).ask.generate_now

Response Example

Configuration File

Set up Ollama in config/active_agent.yml:

yaml

ollama: &ollama
  service: "Ollama"
  model: "gemma3:latest"

Environment Variables

No API keys required. Optionally configure connection settings:

bash

OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3

Installing Ollama

macOS/Linux

bash

# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve

# Pull a model
ollama pull llama3

Docker

bash

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama pull llama3

Supported Models

Ollama supports a wide range of open-source models that run locally on your machine. For the complete list of available models, see Ollama's Model Library.

Popular Models

Model	Sizes	Context Window	Best For
llama3	8B, 70B	8K tokens	General purpose reasoning
mistral	7B	32K tokens	Balanced performance
gemma	2B, 7B	8K tokens	Lightweight, efficient
codellama	7B, 13B, 34B	16K tokens	Code generation and analysis
mixtral	8x7B	32K tokens	High quality, mixture of experts
phi	2.7B	2K tokens	Fast, small footprint
qwen	0.5B to 72B	32K tokens	Multilingual support
deepseek-r1	1.5B to 70B	64K tokens	Advanced reasoning

Recommended model identifiers:

llama3 - Best for general use and reasoning
codellama - Best for code-related tasks
mistral - Best for long context understanding

Quantized Models

Ollama offers quantized versions that reduce memory usage and increase speed with minimal quality loss. For example: ollama pull qwen3:0.6b

List Installed Models

bash

# List all locally available models
ollama list

# Pull a new model
ollama pull llama3

# Remove a model
ollama rm llama3

Provider-Specific Parameters

Required Parameters

model - Model name (e.g., "llama3", "mistral")

Sampling Parameters

temperature - Controls randomness (0.0 to 1.0)
top_p - Nucleus sampling parameter (0.0 to 1.0)
top_k - Top-k sampling parameter (integer ≥ 0)
num_predict - Maximum tokens to generate
seed - For reproducible outputs (integer)
stop - Array of stop sequences

System Configuration

host - Ollama server URL (default: http://localhost:11434)
keep_alive - Keep model loaded in memory (e.g., "5m", "1h")
timeout - Request timeout in seconds

Advanced Options

ruby

class AdvancedOllamaAgent < ApplicationAgent
  generate_with :ollama,
    model: "llama3",
    temperature: 0.7,
    options: {
      num_ctx: 4096,         # Context window size
      num_gpu: 1,            # Number of GPUs to use
      num_thread: 8,         # Number of threads
      repeat_penalty: 1.1,   # Penalize repetition
      mirostat: 2,           # Mirostat sampling
      mirostat_tau: 5.0,     # Mirostat tau parameter
      mirostat_eta: 0.1      # Mirostat learning rate
    }

  def ask
    prompt(message: params[:message])
  end
end

Embeddings

embedding_model - Embedding model name (e.g., "nomic-embed-text")
host - Ollama server URL for embeddings

Streaming

stream - Enable streaming responses (boolean, default: false)

Local Inference

Run models completely offline without external API calls. All inference happens on your machine without requiring an internet connection.

Privacy Benefits:

All data stays on your machine
No external API calls
No internet connection required after model download
Full control over your data

Performance Optimization

Model Loading

Keep models in memory for faster responses:

ruby

class FastOllamaAgent < ApplicationAgent
  generate_with :ollama,
    model: "llama3",
    keep_alive: "5m"  # Keep model loaded for 5 minutes

  def quick_response
    prompt(message: params[:query])
  end
end

Hardware Acceleration

Configure GPU usage for better performance:

ruby

class GPUAgent < ApplicationAgent
  generate_with :ollama,
    model: "llama3",
    options: {
      num_gpu: -1,  # Use all available GPUs
      main_gpu: 0   # Primary GPU index
    }

  def ask
    prompt(message: params[:message])
  end
end

Quantization

Use quantized models for faster inference with less memory:

bash

# Pull quantized versions
ollama pull llama3:8b-q4_0  # 4-bit quantization
ollama pull llama3:8b-q5_1  # 5-bit quantization

ruby

class EfficientAgent < ApplicationAgent
  # Use quantized model for faster inference
  generate_with :ollama, model: "qwen3:0.6b"

  def ask
    prompt(message: params[:message])
  end
end

Structured Output

Ollama does not have native structured output support. However, many models can generate JSON through careful prompting. For comprehensive structured output patterns, see the Structured Output Documentation.

Limitations

No guaranteed JSON output - Depends on model following instructions
No schema enforcement - Cannot guarantee specific field requirements
Quality varies by model - Llama 3, Mixtral, and Mistral work best
Requires validation - Always parse and validate responses

TIP

For applications requiring guaranteed schema conformance, use OpenAI with strict mode or Anthropic. For local processing, implement robust validation and error handling.

Embeddings

Generate embeddings locally using Ollama's embedding models. For comprehensive embedding usage patterns, see the Embeddings Documentation.

Available Embedding Models

Model	Dimensions	Best For
nomic-embed-text	768	High-quality text embeddings
mxbai-embed-large	1024	Large embedding model
all-minilm	384	Lightweight embeddings

Error Handling

Ollama-specific error handling for connection failures and missing models. For comprehensive error handling strategies, see the Error Handling Documentation.

Common Ollama Errors

Errno::ECONNREFUSED - Ollama service not running (start with ollama serve)
Net::OpenTimeout - Connection timeout
ActiveAgent::GenerationError - Model not found or generation failure

Example

ruby

class RobustOllamaAgent < ApplicationAgent
  generate_with :ollama, model: "llama3"

  rescue_from ::OpenAI::Errors::APIConnectionError do |error|
    Rails.logger.error "Ollama not running: #{error.message}"
    "Ollama is not running. Start it with: ollama serve"
  end

  rescue_from StandardError do |error|
    if error.message.include?("model not found")
      # Pull the model if it's not found
      # system("ollama pull #{generation_provider.model}")
      raise error  # Re-raise for this example
    else
      raise
    end
  end

  def ask
    prompt(message: params[:message])
  end
end

Best Practices

Pre-pull models - Download models before first use: ollama pull llama3
Monitor memory usage - Large models require significant RAM (8GB+ recommended)
Use appropriate models - Balance size, speed, and capability for your use case
Keep models loaded - Use keep_alive parameter for frequently used models
Implement fallbacks - Handle connection failures and missing models gracefully
Use quantization - Reduce memory usage and increase speed with quantized models
Test locally - Ensure models work in development before deployment
Consider GPU - Use GPU acceleration for better performance with larger models

Streaming - Real-time response streaming patterns
Embeddings Framework - Complete guide to embeddings
Configuration - Global provider setup
Structured Output - Structured output patterns
Providers Overview - Provider comparison
Configuration Guide - Setup and configuration
Error Handling - Error handling strategies
Ollama Documentation - Official Ollama docs
Ollama Model Library - Available models

Ollama Provider ​

Configuration ​

Basic Setup ​

Basic Usage Example ​

Configuration File ​

Environment Variables ​

Installing Ollama ​

macOS/Linux ​

Docker ​

Supported Models ​

Popular Models ​

List Installed Models ​

Provider-Specific Parameters ​

Required Parameters ​

Sampling Parameters ​

System Configuration ​

Advanced Options ​

Embeddings ​

Streaming ​

Local Inference ​

Performance Optimization ​

Model Loading ​

Hardware Acceleration ​

Quantization ​

Structured Output ​

Limitations ​

Embeddings ​

Available Embedding Models ​

Error Handling ​

Common Ollama Errors ​

Example ​

Best Practices ​

Related Documentation ​

Ollama Provider

Configuration

Basic Setup

Basic Usage Example

Configuration File

Environment Variables

Installing Ollama

macOS/Linux

Docker

Supported Models

Popular Models

List Installed Models

Provider-Specific Parameters

Required Parameters

Sampling Parameters

System Configuration

Advanced Options

Embeddings

Streaming

Local Inference

Performance Optimization

Model Loading

Hardware Acceleration

Quantization

Structured Output

Limitations

Embeddings

Available Embedding Models

Error Handling

Common Ollama Errors

Example

Best Practices

Related Documentation