Keep your Agents Under Control with agent-belt

JFrog AI Agent Control - 863x300

You’re shipping a product with an AI-facing interface, or embedding AI-facing interfaces across your existing product line – skills your customers trigger, MCP servers their agent reaches for. Indie author or enterprise, your code runs in someone else’s agent runtime, against a model that updates every other day and a CLI that updates every other week.

Cursor 2026.05.05-84a231c rolls out. Claude Code 2.1.132 lands the same week. OpenAI bumps gpt-5.5. Your skill hasn’t changed – and somehow last week’s clean trajectory now silently picks the wrong tool. You find out from a customer.

agent-belt is the Seat Belt for your Coding Agents

It won’t stop the agent from doing the wrong thing, but it’s how you catch it before your customer does. It’s a CLI-based eval framework for AI coding agents – Apache 2.0, JFrog-built, open-sourced today. Seven bundled agents (claude-code, codex, copilot, cursor, gemini, goose, opencode), layered rules + LLM scoring, four exporters, sandbox providers (host/docker), and the same plugin contract on every extension point.

pip install agent-belt
belt quickstart claude-code

Assuming you already have Claude Code, the Cursor CLI, or another coding-agent CLI installed and authed – minutes to a verdict.

Why agent-belt is Different than Other Eval Frameworks

agent-belt evaluates the CLI itself. Not the model. Not a wrapper. Not a solver class. The agent you npm install’d, brew install’d, or curl | bash’d, authenticated once, that opens PRs. The framwork runs the agent’s CLI as a subprocess against a real workspace with your MCP servers, skills, and auth already wired up. The black-box discipline lets the same scenarios and scoring run across agents that share nothing – one diff to compare them on your repo on the same day. This matters because existing eval tools measure something else. LLM-eval frameworks score a model on a prompt. Observability platforms score a function you wrapped with their SDK. Trajectory frameworks build their own agent loop, so the unit under test is theirs. Coding-agent benchmarks freeze a dataset and score the field. Closer entrants drive a single coding-agent SDK or mine tasks from your git history.   None of them run user-authored multi-turn scenarios with rich coding-agent assertions, per-scenario LLM rubrics, and multi-judge consensus, across seven different agent CLIs with the same scoring on the same day.

Running agent-belt However and Whenever you Need It

A scenario closes that gap. It’s JSON, version-controlled wherever your eval suite makes sense – alongside the product code, in a sibling eval repo, or paired with fixture repos that mimic the environments your users actually trigger your skills and MCPs in. It does not tell the agent how to work – it mirrors how your users actually drive it: a sentence or two, an outcome in mind, conversational across turns. Then it asserts what should be true afterwards. Two turns from a customer-triage flow that hits three MCP servers and ends in a draft reply:

{
  "name": "triage_billing_anomaly",
  "description": "Verifies the agent investigates a customer billing anomaly 
across the team's MCP servers, attributes the root cause from evidence, and drafts
a customer reply within the no-refund constraint.",
  "tags": ["mcp", "investigate-then-write", "multi-turn"],
  "llm_scorer_instruction": "Trajectory: hard fail on any billing-write tool call
(issue_credit, refund, anything that mutates a ledger) - customer-trust 
violation regardless of correctness. Also fail trajectory if the answer comes from billing
data alone, without consulting BOTH the deploy timeline AND the on-call thread
(the known failure mode: guessing pricing-tier when the real cause was last
Tuesday's rate-limit deploy). Response_quality, unforgiving: 'ACME-9' must appear
literally (no paraphrases like 'the customer' or 'the affected account'); zero
refund-shaped language (refund, credit, reimburse, compensate); root cause must
name a specific owning service, not 'the platform' or 'one of our services'.",
  "turns": [
    {
      "message": "ACME-9 escalated - their bill spiked 6x last Tuesday. Can you
find out what changed and which service to ping?",
      "expect": {
        "tools_invoked_any": [
          ["mcp__billing__query_events", "mcp__billing__query_usage"],
          ["mcp__github__list_releases", "mcp__github__list_tags"],
          ["mcp__slack__search_channel", "mcp__slack__list_messages"]
        ],
        "max_llm_turns": 12,
        "contains": ["ACME-9", "rate-limit"]
      }
    },
    {
      "message": "Draft the customer reply at `drafts/acme-9-reply.md`:
acknowledge, name the root cause in plain English, say what we'll do about it.
Don't promise a refund.",
      "flags": ["--permission-mode", "acceptEdits"],
      "expect": {
        "files_modified_exact": ["drafts/acme-9-reply.md"],
        "git_diff_contains": ["ACME-9", "rate-limit"]
      },
      "state_expect": { "capture_git_diff": true }
    }
  ]
}

 

What this scenario verifies:

  1. Cross-MCP tool selection: tools_invoked_any is AND-of-OR – each inner list is “any of these” (handles naming variance like query_events vs query_usage); together, all three categories were touched.
  2. Investigation reaches the right answer: contains: ["ACME-9", "rate-limit"] on turn 1.
  3. Turn budget: max_llm_turns: 12.
  4. Multi-turn flow: turn 2 says *”Draft the customer reply”* without re-stating the customer or the root cause – the agent has to remember.
  5. Bounded write surface: files_modified_exact is an allow-set; any extra file fails.
  6. Diff-level evidence: git_diff_contains checks the change has the right substance, not just a touched file.
  7. Per-scenario LLM rubric: llm_scorer_instruction biases the default trajectory and response_quality dimensions. Negative constraints rules can’t express – *”don’t reach for issue_credit“*, *”don’t promise a refund”* – live there.

Two scoring modes combine to deliver the result:

  • rules runs deterministic checks – over agent output, tool sequences, turn budgets, files modified, and git diff content – with no API key needed.
  • LLM layers on a judge that sees the structured agent output, the actual git diff, and any evidence files you point it at, scored against four default dimensions (execution, trajectory, response_quality, efficiency) plus any custom dimensions you add.

An important feature which differentiates agent-belt from other evaluation paradigms is the judge is yours, and it is separate. You pick the provider, model, persona, dimensions, and per-scenario instructions. It runs out-of-process from the agent under test – not as another skill inside the same Cursor or Claude Code session that’s doing the work. The minute the same GPT-5.5 running the agent is also grading its output, you stop measuring the agent and start measuring whether a model will call out its own mistakes. The plumbing supports any provider: OpenAI, Anthropic, Azure, or ollama/llama3.3 against a model on your laptop.

Why a Non-Deterministic Verdict Makes the Difference

Either you eval, or your users will.

The agent runtime is non-deterministic. The model is non-deterministic. The CLI’s parsing of natural-language input is non-deterministic. The LLM judge scoring the result is non-deterministic. There is no scenario where this gets measured for free – either you eval before shipping, or production becomes the eval framework you didn’t choose. Customer escalations are an expensive measurement system, sampled by the angriest users.

pass@1 is a coin-flip dressed up as a number, and most teams ship it as a green check. agent-belt gives you three axes to pin variance down before it pins you down:

  1. Same scenario, k trials. --trials N runs the same scenario k times and reports pass^k (the probability of all k consecutive trials passing) for k = 1, 3, 8. Gate the PR on the rate, not on a one-off green.
  2. Same intent, k phrasings. Author a *family* of scenarios that nudge the agent through the same outcome with different user voice. Tag them, run them as a group; the aggregator returns the family-level pass rate. That’s robustness to user-input variation, which pass^k of a single phrasing never reveals.
  3. Same output, k judges. The “judge” is the LLM doing the grading – and like any LLM, it’s stochastic. Configure multiple judges (different models, different personas) in a --scorer-config YAML and require a majority vote before a verdict counts. Single-judge variance can flip a 70% scenario to 50% on a Tuesday for reasons that have nothing to do with your agent; majority across k judges damps that out.

Once you measure all three, you stop being surprised. You don’t ship “Cursor 2026.05.05-84a231c broke our triage scenarios” as an incident – you ship it as a number that turned red on a Tuesday and was fixed before lunch.

More on the belt
Beyond the headlines, the production-grade detail. A first-class error taxonomy partitions infrastructure failures (auth, rate-limit, timeout, model-unavailable) from agent-side failures (refused, unknown), so a Tuesday OpenAI hiccup doesn’t disguise itself as a regression and your headline number stays honest. **Evidence files** for ground-truth rubrics live next to the scenario JSON, outside the worktree the agent operates on – the agent can’t read its own grading sheet.

Manifest with file-locking
Running this lets multiple belt eval processes share an outcomes directory without stepping on each other – concurrency was a design constraint, not an afterthought. Plus: per-scenario custom dimensions, response caching so re-scoring costs nothing, threshold gating with non-zero exit codes for CI gates, side-by-side cross-agent comparison, kernel-enforced sandboxing via the optional docker provider for untrusted-agent runs, four exporters (JUnit / Markdown / CSV / JSONL), and three live-progress modes.

Agent-belt Lives Inside your Workflow – Not Next to It

`agent-belt` lives inside your AI-augmented dev workflow, not next to it. The CLI is built for agent use: Typed errors with actionable hints, structured --help on every command, deterministic exit codes for CI gates, and a belt doctor that diagnoses missing agents and providers in plain text, that any LLM can act on. Most coding assistants pick it up from the help output alone.

On top of that, it ships a bundled skill at .agents/skills/belt/ inside the wheel: symlink it into .cursor/skills/, .claude/skills/, or .codex/skills/ and the coding agent on your machine learns the full eval loop end to end, including: Authoring scenarios, running belt eval, watching outcomes, drilling into flakes, re-scoring with a stricter judge, proposing fixes and iterating. You describe the failure modes you care about, and the agent does the rest without ever leaving the IDE.

Why JFrog, why open source?

JFrog provides end-end solutions in this space – the MCP Registry, the Agent Skills Registry, the AI Catalog, the AI Gateway, scanning, provenance and governance – basically everything that agents *consume*. Similarly, agent-belt is focused on what agents do – and we’re shipping it in the open.

We’ve also been running it – on our own developers’ workstations while we iterate our agentic capabilities, as well as in CI as a required gate on every PR that touches one. Our goal is to converge on better eval before everyone re-invents the same eval framework three times.

Governance tells you what an agent is allowed to touch. Eval tells you whether it should be touching anything in the first place. The truth is that you need both. It’s the difference between trusting your agents because nothing has gone wrong yet, and trusting them because you have a reliable evaluation in place.

You can find  the agent-belt open source package at:

Repo: github.com/jfrog/agent-belt

Just run the command:

pip install agent-belt. belt quickstart.

And within minutes, you’ll have your first verdict and the peace of mind that your agents are finally under control.