Data Extraction
Extract structured data from PDF resumes using AI-powered parsing.
Setup
rails generate active_agent:agent resume_extractor parse --json-schemaThis creates:
app/agents/resume_extractor_agent.rb- Agent classapp/views/agents/resume_extractor/instructions.md- Instructionsapp/views/agents/resume_extractor/parse.json- JSON schema
Quick Start
Download this sample resume to test the agent: Sample Resume
# Read and encode PDF
pdf_file = File.read(path + "sample_resume.pdf")
pdf_data = "data:application/pdf;base64,#{Base64.strict_encode64(pdf_file)}"
# Extract structured data
response = ResumeExtractorAgent.with(document: pdf_data).parse.generate_now
# Access parsed fields
resume = response.message.parsed_json
resume[:name] # => "John Doe"
resume[:email] # => "john.doe@example.com"
resume[:experience] # => [{"job_title"=>"Senior Software Engineer", ...}]JSON Message
activeagent/test/docs/examples/data_extraction_agent_examples_test.rb:45
{
"name": "John Doe",
"email": "john.doe@example.com",
"phone": "(555) 123-4567",
"education": [
{
"degree": "BS Computer Science",
"institution": "Stanford University",
"year": 2020
}
],
"experience": [
{
"job_title": "Senior Software Engineer",
"company": "TechCorp",
"duration": "2020-2024"
}
]
}How It Works
The agent uses structured output to guarantee JSON matching your schema:
class ResumeExtractorAgent < ApplicationAgent
generate_with :openai, model: "gpt-4o"
def parse
prompt(
message: "Extract resume data into JSON.",
document: params[:document],
response_format: :json_schema # Loads parse.json schema
)
end
end{
"name": "resume_schema",
"strict": true,
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"email": { "type": "string" },
"phone": { "type": "string" },
"education": {
"type": "array",
"items": {
"type": "object",
"properties": {
"degree": { "type": "string" },
"institution": { "type": "string" },
"year": { "type": "integer" }
},
"required": ["degree", "institution", "year"],
"additionalProperties": false
}
},
"experience": {
"type": "array",
"items": {
"type": "object",
"properties": {
"jobTitle": { "type": "string" },
"company": { "type": "string" },
"duration": { "type": "string" }
},
"required": ["jobTitle", "company", "duration"],
"additionalProperties": false
}
}
},
"required": ["name", "email", "phone", "education", "experience"],
"additionalProperties": false
}
}Key features:
strict: true- Enforces exact schema complianceadditionalProperties: false- Rejects unexpected fields- Automatic JSON parsing -
response.message.contentreturns a hash - Type validation - Ensures correct data types (string, integer, array)
Schema Options
Static Schema Files
Define schemas in JSON files under app/views/agents/resume_extractor/:
response_format: :json_schema # Loads parse.json automaticallyWhen to use:
- Standard data structures
- Stable requirements
- Team collaboration (reviewable JSON files)
Model-Generated Schemas
Generate schemas dynamically from your models:
class Resume
include ActiveModel::Model
include ActiveModel::Attributes
include ActiveAgent::SchemaGenerator
attribute :name, :string
attribute :email, :string
attribute :phone, :string
attribute :education
attribute :experience
validates :name, presence: true, length: { minimum: 2 }
validates :email, presence: true, format: { with: URI::MailTo::EMAIL_REGEXP }
validates :phone, presence: true
endclass ResumeExtractorAgent < ApplicationAgent
generate_with :openai, model: "gpt-4o"
def parse
prompt(
message: "Extract resume data into JSON.",
document: params[:document],
response_format: {
type: "json_schema",
json_schema: Resume.to_json_schema(strict: true, name: "resume_schema")
}
)
end
endWhen to use:
- Existing ActiveRecord/ActiveModel classes
- Schema mirrors database structure
- Single source of truth for validations
Learn more: Structured Output
Common Patterns
Background Processing
For high-volume processing:
class ResumeProcessingJob < ApplicationJob
def perform(pdf_path)
pdf_data = File.read(pdf_path)
pdf_url = "data:application/pdf;base64,#{Base64.strict_encode64(pdf_data)}"
response = ResumeExtractorAgent.with(document: pdf_url).parse.generate_now
Resume.create!(response.message.content) if response.success?
end
end
# Enqueue jobs
Dir.glob("resumes/*.pdf").each do |path|
ResumeProcessingJob.perform_later(path)
endConsensus Validation
Ensure extraction accuracy by requiring multiple attempts to agree:
class ResumeExtractorAgent < ApplicationAgent
generate_with :openai, model: "gpt-4o"
# Require two extraction attempts to produce identical results
around_prompt do |agent, action|
attempt_one = action.call
attempt_two = action.call
next if attempt_one.message.parsed_json == attempt_two.message.parsed_json
fail "Consensus not reached in #{agent.class.name}##{agent.action_name}: " \
"Two attempts produced different results"
end
def parse
prompt(
message: "Extract resume data into JSON.",
document: params[:document],
response_format: :json_schema
)
end
endThis validates extraction reliability by running the agent twice and comparing results. Useful for:
- Critical data where accuracy is essential
- Detecting inconsistent model outputs
- Building confidence in extracted data
Provider Support
Resume extraction works with providers that support:
- PDF processing - Native or via plugins
- Structured output - JSON schema validation
Recommended Providers
| Provider | Model | Notes |
|---|---|---|
| OpenAI | gpt-4o | Native PDF support, structured output |
| OpenAI | gpt-4o-mini | Faster, lower cost |
| Anthropic | claude-3-5-sonnet | Strong reasoning, base64 PDF |
| OpenRouter | openai/gpt-4o | Access via OpenRouter |
TIP
OpenAI's GPT-4o models provide the best balance of accuracy and speed for resume extraction with native structured output support.
See Also
- Structured Output - JSON schema validation
- Messages - Multimodal content (PDFs, images)
- OpenAI Provider - Configuration details
- OpenRouter Provider - Alternative provider with 200+ models