Getting started
Installation
pip install bayesbench
Optional integrations:
pip install "bayesbench[openai]"
pip install "bayesbench[anthropic]"
pip install "bayesbench[huggingface]"
pip install "bayesbench[inspect]"
pip install "bayesbench[mteb]"
pip install "bayesbench[openclaw]"
pip install "bayesbench[all]"
Minimal pairwise benchmark
from bayesbench import BayesianBenchmark
bench = BayesianBenchmark(confidence=0.95)
result = bench.compare(
model_a=lambda p: big_model(p["question"]),
model_b=lambda p: small_model(p["question"]),
dataset=problems,
score_fn=lambda p, r: int(r.strip() == p["answer"]),
name="quickstart_exact_match",
)
print(result.winner)
print(result.p_a_beats_b)
print(result.efficiency)
Register multiple tasks
from bayesbench import BayesianBenchmark
bench = BayesianBenchmark(confidence=0.95, min_samples=5)
@bench.task(dataset=gsm8k, name="gsm8k")
def math_task(problem):
return model_a(problem["q"]) == problem["a"], model_b(problem["q"]) == problem["a"]
@bench.task(dataset=mmlu, name="mmlu")
def reasoning_task(problem):
return model_a(problem["q"]) == problem["a"], model_b(problem["q"]) == problem["a"]
report = bench.run(verbose=True)
print(report.summary())
Use continuous metrics
from bayesbench import BayesianBenchmark
from bayesbench.posteriors import NormalPosterior
bench = BayesianBenchmark(confidence=0.95, posterior_factory=NormalPosterior)
result = bench.compare(
model_a=translation_model_a,
model_b=translation_model_b,
dataset=translation_set,
score_fn=lambda p, r: compute_bleu(r, p["reference"]),
name="bleu_eval",
)
CLI usage
bayesbench my_benchmark.py
bayesbench my_benchmark.py --confidence 0.99 --min-samples 10 --skip-threshold 0.90
bayesbench --version
Your benchmark file should expose either:
bench = BayesianBenchmark(...), or- a
@suite-decorated class.
What to read next
- Workflows for complete templates.
- Concepts for tuning confidence and stopping behavior.
- Examples gallery for scripts in
examples/.