Skip to content

bayesbench

Home

bayesbench documentation

Bayesian sequential benchmarking for LLMs and agents.

bayesbench helps you stop evaluations as soon as posterior evidence is strong enough, instead of evaluating every model on every example.

What you'll find in these docs

Getting started: installation, first benchmark, CLI usage.
Workflows: end-to-end templates for LLM and agentic evaluations.
Examples gallery: copy-pasteable snippets mapped to the examples/ folder.
Concepts: confidence, early stopping, posterior choices, and tuning tips.
Adapters: provider/framework integrations and when to use each.
API reference: core classes and methods.

Why bayesbench

Lower evaluation cost: stop when evidence is sufficient.
Statistically principled: Bayesian posteriors and credible intervals.
Flexible metrics: binary and continuous scoring.
Practical integrations: OpenAI-compatible APIs, Anthropic, Hugging Face, Inspect, MTEB, and OpenClaw.

Typical evaluation journey

Pick a workflow that matches your stack (Inspect, MTEB, OpenAI-compatible, OpenClaw).
Start with confidence=0.95 and a small min_samples (for fast iteration).
Run benchmarks and inspect winner, P(A > B), and efficiency.
Raise confidence to 0.99 for higher-stakes final runs.
Export reports to CSV/JSON for tracking over time.

Need a fast start?

Go to Getting started for a minimal script you can run in minutes.