Skip to content

Data Extraction

Extract structured data from PDF resumes using AI-powered parsing.

Setup

bash
rails generate active_agent:agent resume_extractor parse --json-schema

This creates:

  • app/agents/resume_extractor_agent.rb - Agent class
  • app/views/agents/resume_extractor/instructions.md - Instructions
  • app/views/agents/resume_extractor/parse.json - JSON schema

Quick Start

Download this sample resume to test the agent: Sample Resume

ruby
# Read and encode PDF
pdf_file = File.read(path + "sample_resume.pdf")
pdf_data = "data:application/pdf;base64,#{Base64.strict_encode64(pdf_file)}"

# Extract structured data
response = ResumeExtractorAgent.with(document: pdf_data).parse.generate_now

# Access parsed fields
resume = response.message.parsed_json
resume[:name]        # => "John Doe"
resume[:email]       # => "john.doe@example.com"
resume[:experience]  # => [{"job_title"=>"Senior Software Engineer", ...}]
JSON Message

activeagent/test/docs/examples/data_extraction_agent_examples_test.rb:45

json
{
  "name": "John Doe",
  "email": "john.doe@example.com",
  "phone": "(555) 123-4567",
  "education": [
    {
      "degree": "BS Computer Science",
      "institution": "Stanford University",
      "year": 2020
    }
  ],
  "experience": [
    {
      "job_title": "Senior Software Engineer",
      "company": "TechCorp",
      "duration": "2020-2024"
    }
  ]
}

How It Works

The agent uses structured output to guarantee JSON matching your schema:

ruby
class ResumeExtractorAgent < ApplicationAgent
  generate_with :openai, model: "gpt-4o"

  def parse
    prompt(
      message: "Extract resume data into JSON.",
      document: params[:document],
      response_format: :json_schema # Loads parse.json schema
    )
  end
end
json
{
  "name": "resume_schema",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "email": { "type": "string" },
      "phone": { "type": "string" },
      "education": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "degree": { "type": "string" },
            "institution": { "type": "string" },
            "year": { "type": "integer" }
          },
          "required": ["degree", "institution", "year"],
          "additionalProperties": false
        }
      },
      "experience": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "jobTitle": { "type": "string" },
            "company": { "type": "string" },
            "duration": { "type": "string" }
          },
          "required": ["jobTitle", "company", "duration"],
          "additionalProperties": false
        }
      }
    },
    "required": ["name", "email", "phone", "education", "experience"],
    "additionalProperties": false
  }
}

Key features:

  • strict: true - Enforces exact schema compliance
  • additionalProperties: false - Rejects unexpected fields
  • Automatic JSON parsing - response.message.content returns a hash
  • Type validation - Ensures correct data types (string, integer, array)

Schema Options

Static Schema Files

Define schemas in JSON files under app/views/agents/resume_extractor/:

ruby
  response_format: :json_schema  # Loads parse.json automatically

When to use:

  • Standard data structures
  • Stable requirements
  • Team collaboration (reviewable JSON files)

Model-Generated Schemas

Generate schemas dynamically from your models:

ruby
class Resume
  include ActiveModel::Model
  include ActiveModel::Attributes
  include ActiveAgent::SchemaGenerator

  attribute :name, :string
  attribute :email, :string
  attribute :phone, :string
  attribute :education
  attribute :experience

  validates :name, presence: true, length: { minimum: 2 }
  validates :email, presence: true, format: { with: URI::MailTo::EMAIL_REGEXP }
  validates :phone, presence: true
end
ruby
class ResumeExtractorAgent < ApplicationAgent
  generate_with :openai, model: "gpt-4o"

  def parse
    prompt(
      message: "Extract resume data into JSON.",
      document: params[:document],
      response_format: {
        type: "json_schema",
        json_schema: Resume.to_json_schema(strict: true, name: "resume_schema")
      }
    )
  end
end

When to use:

  • Existing ActiveRecord/ActiveModel classes
  • Schema mirrors database structure
  • Single source of truth for validations

Learn more: Structured Output

Common Patterns

Background Processing

For high-volume processing:

ruby
class ResumeProcessingJob < ApplicationJob
  def perform(pdf_path)
    pdf_data = File.read(pdf_path)
    pdf_url = "data:application/pdf;base64,#{Base64.strict_encode64(pdf_data)}"

    response = ResumeExtractorAgent.with(document: pdf_url).parse.generate_now

    Resume.create!(response.message.content) if response.success?
  end
end

# Enqueue jobs
Dir.glob("resumes/*.pdf").each do |path|
  ResumeProcessingJob.perform_later(path)
end

Consensus Validation

Ensure extraction accuracy by requiring multiple attempts to agree:

ruby
class ResumeExtractorAgent < ApplicationAgent
  generate_with :openai, model: "gpt-4o"

  # Require two extraction attempts to produce identical results
  around_prompt do |agent, action|
    attempt_one = action.call
    attempt_two = action.call

    next if attempt_one.message.parsed_json == attempt_two.message.parsed_json

    fail "Consensus not reached in #{agent.class.name}##{agent.action_name}: " \
         "Two attempts produced different results"
  end

  def parse
    prompt(
      message: "Extract resume data into JSON.",
      document: params[:document],
      response_format: :json_schema
    )
  end
end

This validates extraction reliability by running the agent twice and comparing results. Useful for:

  • Critical data where accuracy is essential
  • Detecting inconsistent model outputs
  • Building confidence in extracted data

Provider Support

Resume extraction works with providers that support:

  • PDF processing - Native or via plugins
  • Structured output - JSON schema validation
ProviderModelNotes
OpenAIgpt-4oNative PDF support, structured output
OpenAIgpt-4o-miniFaster, lower cost
Anthropicclaude-3-5-sonnetStrong reasoning, base64 PDF
OpenRouteropenai/gpt-4oAccess via OpenRouter

TIP

OpenAI's GPT-4o models provide the best balance of accuracy and speed for resume extraction with native structured output support.

See Also