Introducing Qualspec: LLM-Judged Testing for Ruby

Traditional testing works great when you can assert exact values. But how do you test whether an AI agent’s response is helpful? Or empathetic? Or stays in character?

You can’t expect(response).to eq("the exact perfect answer"). LLM outputs are non-deterministic by design.

Qualspec solves this by using LLMs as judges. Define qualitative criteria in plain English, and let a model evaluate whether responses meet your standards.

New Models Drop Weekly. Can You Keep Up?

A new model drops. The benchmarks look promising. It’s 10x cheaper than what you’re using now.

But will it actually work for your use case?

You could manually test a few prompts. Copy-paste into a playground. Squint at the outputs. Repeat for the next model. And the next.

Or you could run your entire evaluation suite against every contender in parallel:

Qualspec.evaluation "Weekly Model Eval" do
  candidates do
    candidate "current",     model: "anthropic/claude-3.5-sonnet"
    candidate "challenger1", model: "google/gemini-2.0-flash"
    candidate "challenger2", model: "deepseek/deepseek-chat"
    candidate "challenger3", model: "openai/gpt-4o-mini"
  end

  scenario "complex reasoning" do
    prompt "Explain why this Rails N+1 query is slow and fix it: ..."
    eval :code_quality
    eval "identifies the N+1 problem specifically"
    eval "provides a working solution using includes or joins"
  end

  scenario "customer empathy" do
    prompt "I've been on hold for 2 hours!"
    eval :empathetic
    eval :helpful
  end

  # ... 50 more scenarios covering your actual use cases
end

One command. All models. HTML report showing exactly where each one excels or falls short. Now you can make data-driven decisions about model selection instead of guessing based on vibes and Twitter hype.

The model that wins MMLU might bomb your specific domain. The cheap one might be 90% as good at 10% the cost. You won’t know until you test your prompts against your criteria.

The Core Problem

When building AI-powered applications, you face a testing gap:

Unit tests catch regressions but can’t evaluate quality
Manual review doesn’t scale
Exact string matching is brittle and misses the point

You need automated tests that understand intent, not just syntax.

RSpec Integration for Your Agents

Already have a production agent? Add qualitative tests alongside your existing specs:

require "qualspec/rspec"

RSpec.describe CustomerSupportAgent do
  include Qualspec::RSpec::Helpers

  it "handles frustrated customers appropriately" do
    response = agent.call("This is the THIRD time I've called about this!")

    result = qualspec_evaluate(response, :empathetic)
    expect(result).to be_passing
  end
end

The :empathetic rubric checks that your agent:

Acknowledges the user’s frustration
Doesn’t blame or talk down
Offers concrete next steps
Maintains a warm but professional tone

No regex. No fragile substring checks. Just clear criteria evaluated by an LLM judge.

Built-in Rubrics

Qualspec ships with rubrics for common evaluation patterns:

:tool_calling - Does the agent use tools correctly?
:in_character - Does it maintain persona consistency?
:safety - Does it refuse harmful requests appropriately?
:helpful - Does it actually answer the question?
:concise - Does it avoid unnecessary verbosity?
:code_quality - Is generated code correct and idiomatic?
:grounded - Does it stick to provided context without hallucinating?
:follows_instructions - Does it respect format and constraints?

Or define your own:

Qualspec.define_rubric :my_brand_voice do
  criterion "Uses casual, friendly language"
  criterion "Avoids corporate jargon"
  criterion "Includes exactly one emoji per response"
end

Recording for CI

LLM calls are expensive and slow. Qualspec integrates with VCR to record responses and replay them in CI:

QUALSPEC_RECORD_MODE=once bundle exec rspec  # Record
QUALSPEC_RECORD_MODE=none bundle exec rspec  # Playback

Your qualitative tests run fast and deterministically in CI, while still catching regressions when you change prompts or agent logic.

Get Started

gem install qualspec
export QUALSPEC_API_KEY=your_openrouter_key

Check out the GitHub repository for full documentation.

Stop guessing which model is best. Stop manually testing every new release. Define your criteria once, run against everything, and let the data decide.

Introducing Qualspec! (Merry Christmas)