EvalFox — Know exactly how good your agent is, all the time

How it works

From agent URL to quality scores in under 10 minutes

No infrastructure to manage. No code changes required. Just point EvalFox at your agent and get a full quality picture instantly.

Connect your agent

Paste your agent's endpoint URL and configure auth headers. EvalFox auto-detects the protocol — REST, SSE, WebSocket, or OpenAI-compatible — and fires a live probe in seconds.

Define your test suite

Upload your PRD or system prompt and EvalFox auto-generates test cases and judging criteria matched to your requirements. Or build tests manually — single-turn or full multi-turn conversations.

Watch scores run continuously

EvalFox evaluates every run across quality, safety, and performance dimensions. Regressions trigger instant alerts. Every score ships with LLM judge reasoning you can inspect and trust.

01 Connect

https://api.acme-agent.com/v1/chat REST ● Connected

02 Generate tests

Happy-path coverage 12

Edge & negative cases 8

Adversarial / jailbreak 6

+ 24 more generated from your PRD

03 Live score

87/100

48 / 50 tests passing

LIVE

Features

Everything your team needs to ship confident agents

Built for the full evaluation lifecycle — from first prototype to production monitoring.

Zero-code endpoint testing

Paste any agent URL. Configure request shape and response path with a visual builder. No SDK, no proxy, no backend changes — just a URL and you're evaluating.

PRD → test suite in seconds

Upload your product requirements doc. EvalFox parses it and generates happy-path, edge-case, negative, adversarial, and multi-turn tests — plus a matched set of judging criteria.

LLM-as-Judge with reasoning

Every score comes with a chain-of-thought explanation from a configurable judge model. Understand exactly why a test passed or failed, not just that it did.

Regression detection

Baseline pinning, automated comparison on every new run, and immediate alerts via Slack, email, or webhook the moment a dimension drops below threshold.

CI/CD integration

GitHub Actions, GitLab CI, and CircleCI plugins that trigger a full test suite on every push and block merges when the score falls below your quality bar.

Continuous monitoring

Schedule runs every 15 minutes or on deploy. In proxy mode, EvalFox scores a sample of real production traffic — a live quality signal with no synthetic overhead.

Multi-dimensional scoring

One score isn't enough. We track all of them.

EvalFox scores every agent response across quality, reliability, safety, and performance — each dimension tracked independently so you know exactly what shifted, and why.

Build your own scoring model

Correctness Safety Coherence Tone match Brand voice Code execution + Add custom dimension

Add unlimited custom dimensions, each with its own rubric and judge prompt.
Set custom weights so the composite EvalFox Score reflects what matters to you — weight guardrails at 40% or prioritize correctness, your call.
Bring your own judge model — GPT, Claude, or self-hosted — and tune thresholds per dimension.

Evaluation dimensions

Quality

Correctness, relevance, coherence, completeness

↑ 3

Safety

Guardrails, PII leakage, jailbreak resistance

↑ 1

Reliability

Context retention, tool call accuracy, hallucination rate

↑ 6

Performance

TTFT, total latency, token cost per call

↓ 8

Pricing

Start free, scale as you grow

Usage-based overages, transparent judge model cost pass-through, no surprises.

Free

forever

1 agent
50 test cases
100 runs / month
5 default scoring dimensions
Dashboard & run history

Get started

Starter

$49

per month

3 agents
500 test cases
2,000 runs / month
5 custom criteria
Regression alerts
CI/CD integration

Start trial

Know exactly how good your agent is, all the time.