bayesbench documentation
Bayesian sequential benchmarking for LLMs and agents.
bayesbench helps you stop evaluations as soon as posterior evidence is strong enough,
instead of evaluating every model on every example.
What you'll find in these docs
- Getting started: installation, first benchmark, CLI usage.
- Workflows: end-to-end templates for LLM and agentic evaluations.
- Examples gallery: copy-pasteable snippets mapped to the
examples/folder. - Concepts: confidence, early stopping, posterior choices, and tuning tips.
- Adapters: provider/framework integrations and when to use each.
- API reference: core classes and methods.
Why bayesbench
- Lower evaluation cost: stop when evidence is sufficient.
- Statistically principled: Bayesian posteriors and credible intervals.
- Flexible metrics: binary and continuous scoring.
- Practical integrations: OpenAI-compatible APIs, Anthropic, Hugging Face, Inspect, MTEB, and OpenClaw.
Typical evaluation journey
- Pick a workflow that matches your stack (Inspect, MTEB, OpenAI-compatible, OpenClaw).
- Start with
confidence=0.95and a smallmin_samples(for fast iteration). - Run benchmarks and inspect winner, P(A > B), and efficiency.
- Raise confidence to
0.99for higher-stakes final runs. - Export reports to CSV/JSON for tracking over time.
Need a fast start?
Go to Getting started for a minimal script you can run in minutes.