Skip to content

Ollama Provider

The Ollama provider enables local LLM inference using the Ollama platform. Run models like Llama 3, Mistral, and Gemma locally without sending data to external APIs, perfect for privacy-sensitive applications and development.

Configuration

Basic Setup

Configure Ollama in your agent:

ruby
class OllamaAgent < ApplicationAgent
  generate_with :ollama, model: "deepseek-r1:latest"

  # @return [ActiveAgent::Generation]
  def ask
    prompt(message: params[:message])
  end
end

Basic Usage Example

ruby
response = OllamaAgent.with(
  message: "What is a design pattern?"
).ask.generate_now
Response Example

Configuration File

Set up Ollama in config/active_agent.yml:

yaml
ollama: &ollama
  service: "Ollama"
  model: "gemma3:latest"

Environment Variables

No API keys required. Optionally configure connection settings:

bash
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=llama3

Installing Ollama

macOS/Linux

bash
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh

# Start Ollama service
ollama serve

# Pull a model
ollama pull llama3

Docker

bash
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama pull llama3

Supported Models

Ollama supports a wide range of open-source models that run locally on your machine. For the complete list of available models, see Ollama's Model Library.

ModelSizesContext WindowBest For
llama38B, 70B8K tokensGeneral purpose reasoning
mistral7B32K tokensBalanced performance
gemma2B, 7B8K tokensLightweight, efficient
codellama7B, 13B, 34B16K tokensCode generation and analysis
mixtral8x7B32K tokensHigh quality, mixture of experts
phi2.7B2K tokensFast, small footprint
qwen0.5B to 72B32K tokensMultilingual support
deepseek-r11.5B to 70B64K tokensAdvanced reasoning

Recommended model identifiers:

  • llama3 - Best for general use and reasoning
  • codellama - Best for code-related tasks
  • mistral - Best for long context understanding

Quantized Models

Ollama offers quantized versions that reduce memory usage and increase speed with minimal quality loss. For example: ollama pull qwen3:0.6b

List Installed Models

bash
# List all locally available models
ollama list

# Pull a new model
ollama pull llama3

# Remove a model
ollama rm llama3

Provider-Specific Parameters

Required Parameters

  • model - Model name (e.g., "llama3", "mistral")

Sampling Parameters

  • temperature - Controls randomness (0.0 to 1.0)
  • top_p - Nucleus sampling parameter (0.0 to 1.0)
  • top_k - Top-k sampling parameter (integer ≥ 0)
  • num_predict - Maximum tokens to generate
  • seed - For reproducible outputs (integer)
  • stop - Array of stop sequences

System Configuration

  • host - Ollama server URL (default: http://localhost:11434)
  • keep_alive - Keep model loaded in memory (e.g., "5m", "1h")
  • timeout - Request timeout in seconds

Advanced Options

ruby
class AdvancedOllamaAgent < ApplicationAgent
  generate_with :ollama,
    model: "llama3",
    temperature: 0.7,
    options: {
      num_ctx: 4096,         # Context window size
      num_gpu: 1,            # Number of GPUs to use
      num_thread: 8,         # Number of threads
      repeat_penalty: 1.1,   # Penalize repetition
      mirostat: 2,           # Mirostat sampling
      mirostat_tau: 5.0,     # Mirostat tau parameter
      mirostat_eta: 0.1      # Mirostat learning rate
    }

  def ask
    prompt(message: params[:message])
  end
end

Embeddings

  • embedding_model - Embedding model name (e.g., "nomic-embed-text")
  • host - Ollama server URL for embeddings

Streaming

  • stream - Enable streaming responses (boolean, default: false)

Local Inference

Run models completely offline without external API calls. All inference happens on your machine without requiring an internet connection.

Privacy Benefits:

  • All data stays on your machine
  • No external API calls
  • No internet connection required after model download
  • Full control over your data

Performance Optimization

Model Loading

Keep models in memory for faster responses:

ruby
class FastOllamaAgent < ApplicationAgent
  generate_with :ollama,
    model: "llama3",
    keep_alive: "5m"  # Keep model loaded for 5 minutes

  def quick_response
    prompt(message: params[:query])
  end
end

Hardware Acceleration

Configure GPU usage for better performance:

ruby
class GPUAgent < ApplicationAgent
  generate_with :ollama,
    model: "llama3",
    options: {
      num_gpu: -1,  # Use all available GPUs
      main_gpu: 0   # Primary GPU index
    }

  def ask
    prompt(message: params[:message])
  end
end

Quantization

Use quantized models for faster inference with less memory:

bash
# Pull quantized versions
ollama pull llama3:8b-q4_0  # 4-bit quantization
ollama pull llama3:8b-q5_1  # 5-bit quantization
ruby
class EfficientAgent < ApplicationAgent
  # Use quantized model for faster inference
  generate_with :ollama, model: "qwen3:0.6b"

  def ask
    prompt(message: params[:message])
  end
end

Structured Output

Ollama does not have native structured output support. However, many models can generate JSON through careful prompting. For comprehensive structured output patterns, see the Structured Output Documentation.

Limitations

  • No guaranteed JSON output - Depends on model following instructions
  • No schema enforcement - Cannot guarantee specific field requirements
  • Quality varies by model - Llama 3, Mixtral, and Mistral work best
  • Requires validation - Always parse and validate responses

TIP

For applications requiring guaranteed schema conformance, use OpenAI with strict mode or Anthropic. For local processing, implement robust validation and error handling.

Embeddings

Generate embeddings locally using Ollama's embedding models. For comprehensive embedding usage patterns, see the Embeddings Documentation.

Available Embedding Models

ModelDimensionsBest For
nomic-embed-text768High-quality text embeddings
mxbai-embed-large1024Large embedding model
all-minilm384Lightweight embeddings

Error Handling

Ollama-specific error handling for connection failures and missing models. For comprehensive error handling strategies, see the Error Handling Documentation.

Common Ollama Errors

  • Errno::ECONNREFUSED - Ollama service not running (start with ollama serve)
  • Net::OpenTimeout - Connection timeout
  • ActiveAgent::GenerationError - Model not found or generation failure

Example

ruby
class RobustOllamaAgent < ApplicationAgent
  generate_with :ollama, model: "llama3"

  rescue_from ::OpenAI::Errors::APIConnectionError do |error|
    Rails.logger.error "Ollama not running: #{error.message}"
    "Ollama is not running. Start it with: ollama serve"
  end

  rescue_from StandardError do |error|
    if error.message.include?("model not found")
      # Pull the model if it's not found
      # system("ollama pull #{generation_provider.model}")
      raise error  # Re-raise for this example
    else
      raise
    end
  end

  def ask
    prompt(message: params[:message])
  end
end

Best Practices

  1. Pre-pull models - Download models before first use: ollama pull llama3
  2. Monitor memory usage - Large models require significant RAM (8GB+ recommended)
  3. Use appropriate models - Balance size, speed, and capability for your use case
  4. Keep models loaded - Use keep_alive parameter for frequently used models
  5. Implement fallbacks - Handle connection failures and missing models gracefully
  6. Use quantization - Reduce memory usage and increase speed with quantized models
  7. Test locally - Ensure models work in development before deployment
  8. Consider GPU - Use GPU acceleration for better performance with larger models