Data Extraction

Extract structured data from PDF resumes using AI-powered parsing.

Setup

bash

rails generate active_agent:agent resume_extractor parse --json-schema

This creates:

app/agents/resume_extractor_agent.rb - Agent class
app/views/agents/resume_extractor/instructions.md - Instructions
app/views/agents/resume_extractor/parse.json - JSON schema

Quick Start

Download this sample resume to test the agent: Sample Resume

ruby

# Read and encode PDF
pdf_file = File.read(path + "sample_resume.pdf")
pdf_data = "data:application/pdf;base64,#{Base64.strict_encode64(pdf_file)}"

# Extract structured data
response = ResumeExtractorAgent.with(document: pdf_data).parse.generate_now

# Access parsed fields
resume = response.message.parsed_json
resume[:name]        # => "John Doe"
resume[:email]       # => "john.doe@example.com"
resume[:experience]  # => [{"job_title"=>"Senior Software Engineer", ...}]

JSON Message

How It Works

The agent uses structured output to guarantee JSON matching your schema:

resume_extractor_agent.rbparse.json

ruby

class ResumeExtractorAgent < ApplicationAgent
  generate_with :openai, model: "gpt-4o"

  def parse
    prompt(
      message: "Extract resume data into JSON.",
      document: params[:document],
      response_format: :json_schema # Loads parse.json schema
    )
  end
end

json

{
  "name": "resume_schema",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "email": { "type": "string" },
      "phone": { "type": "string" },
      "education": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "degree": { "type": "string" },
            "institution": { "type": "string" },
            "year": { "type": "integer" }
          },
          "required": ["degree", "institution", "year"],
          "additionalProperties": false
        }
      },
      "experience": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "jobTitle": { "type": "string" },
            "company": { "type": "string" },
            "duration": { "type": "string" }
          },
          "required": ["jobTitle", "company", "duration"],
          "additionalProperties": false
        }
      }
    },
    "required": ["name", "email", "phone", "education", "experience"],
    "additionalProperties": false
  }
}

Key features:

strict: true - Enforces exact schema compliance
additionalProperties: false - Rejects unexpected fields
Automatic JSON parsing - response.message.content returns a hash
Type validation - Ensures correct data types (string, integer, array)

Schema Options

Static Schema Files

Define schemas in JSON files under app/views/agents/resume_extractor/:

ruby

  response_format: :json_schema  # Loads parse.json automatically

When to use:

Standard data structures
Stable requirements
Team collaboration (reviewable JSON files)

Model-Generated Schemas

Generate schemas dynamically from your models:

resume.rbresume_extractor_agent.rb

ruby

class Resume
  include ActiveModel::Model
  include ActiveModel::Attributes
  include ActiveAgent::SchemaGenerator

  attribute :name, :string
  attribute :email, :string
  attribute :phone, :string
  attribute :education
  attribute :experience

  validates :name, presence: true, length: { minimum: 2 }
  validates :email, presence: true, format: { with: URI::MailTo::EMAIL_REGEXP }
  validates :phone, presence: true
end

ruby

class ResumeExtractorAgent < ApplicationAgent
  generate_with :openai, model: "gpt-4o"

  def parse
    prompt(
      message: "Extract resume data into JSON.",
      document: params[:document],
      response_format: {
        type: "json_schema",
        json_schema: Resume.to_json_schema(strict: true, name: "resume_schema")
      }
    )
  end
end

When to use:

Existing ActiveRecord/ActiveModel classes
Schema mirrors database structure
Single source of truth for validations

Learn more: Structured Output

Common Patterns

Background Processing

For high-volume processing:

ruby

class ResumeProcessingJob < ApplicationJob
  def perform(pdf_path)
    pdf_data = File.read(pdf_path)
    pdf_url = "data:application/pdf;base64,#{Base64.strict_encode64(pdf_data)}"

    response = ResumeExtractorAgent.with(document: pdf_url).parse.generate_now

    Resume.create!(response.message.content) if response.success?
  end
end

# Enqueue jobs
Dir.glob("resumes/*.pdf").each do |path|
  ResumeProcessingJob.perform_later(path)
end

Consensus Validation

Ensure extraction accuracy by requiring multiple attempts to agree:

ruby

class ResumeExtractorAgent < ApplicationAgent
  generate_with :openai, model: "gpt-4o"

  # Require two extraction attempts to produce identical results
  around_prompt do |agent, action|
    attempt_one = action.call
    attempt_two = action.call

    next if attempt_one.message.parsed_json == attempt_two.message.parsed_json

    fail "Consensus not reached in #{agent.class.name}##{agent.action_name}: " \
         "Two attempts produced different results"
  end

  def parse
    prompt(
      message: "Extract resume data into JSON.",
      document: params[:document],
      response_format: :json_schema
    )
  end
end

This validates extraction reliability by running the agent twice and comparing results. Useful for:

Critical data where accuracy is essential
Detecting inconsistent model outputs
Building confidence in extracted data

Provider Support

Resume extraction works with providers that support:

PDF processing - Native or via plugins
Structured output - JSON schema validation

Recommended Providers

Provider	Model	Notes
OpenAI	gpt-4o	Native PDF support, structured output
OpenAI	gpt-4o-mini	Faster, lower cost
Anthropic	claude-3-5-sonnet	Strong reasoning, base64 PDF
OpenRouter	openai/gpt-4o	Access via OpenRouter

TIP

OpenAI's GPT-4o models provide the best balance of accuracy and speed for resume extraction with native structured output support.

Data Extraction ​

Setup ​

Quick Start ​

How It Works ​

Schema Options ​

Static Schema Files ​

Model-Generated Schemas ​

Common Patterns ​

Background Processing ​

Consensus Validation ​

Provider Support ​

Recommended Providers ​

See Also ​

Data Extraction

Setup

Quick Start

How It Works

Schema Options

Static Schema Files

Model-Generated Schemas

Common Patterns

Background Processing

Consensus Validation

Provider Support

Recommended Providers

See Also