Workflow playbooks
This page provides practical, end-to-end workflows for common benchmarking setups. Each workflow includes when to use it, a reference implementation, and expected outputs.
Workflow 1: LLM A/B with OpenAI-compatible APIs
Use when: comparing two chat/completion models (OpenAI, Groq, Together, Ollama, vLLM).
from bayesbench import BayesianBenchmark
from bayesbench.adapters.openai_compat import openai_model
bench = BayesianBenchmark(confidence=0.95, min_samples=5)
model_a = openai_model("gpt-4o")
model_b = openai_model("llama-3.1-70b-versatile", base_url="https://api.groq.com/openai/v1")
result = bench.compare(
model_a=model_a,
model_b=model_b,
dataset=problems,
score_fn=lambda p, r: int(r.strip() == p["answer"]),
name="openai_compat_ab",
)
print(result.winner, result.p_a_beats_b, result.efficiency)
Typical output interpretation:
winnertells you which model has stronger posterior support.p_a_beats_bindicates certainty.efficiencymeasures cost saved from stopping early.
Workflow 2: Multi-task suite for release gating
Use when: you need one report across multiple benchmarks (math, reasoning, coding, etc.).
from bayesbench import BayesianBenchmark
bench = BayesianBenchmark(confidence=0.95)
@bench.task(dataset=math_problems, name="math")
def math_task(problem):
return model_a(problem) == problem["answer"], model_b(problem) == problem["answer"]
@bench.task(dataset=reasoning_problems, name="reasoning")
def reasoning_task(problem):
return score_reasoning(problem, model_a(problem)), score_reasoning(problem, model_b(problem))
report = bench.run(verbose=True)
print(report.summary())
Why teams use this:
- One unified report for go/no-go decisions.
- Early stopping can reduce cost on each task independently.
Workflow 3: Inspect-native dataset and model pipeline
Use when: your existing evaluation stack is built around AISI Inspect.
from inspect_ai.dataset import hf_dataset, FieldSpec
from bayesbench import BayesianBenchmark
from bayesbench.adapters.inspect_ai import from_inspect_dataset, inspect_model, exact_match_score
bench = BayesianBenchmark(confidence=0.95)
problems = from_inspect_dataset(
hf_dataset("openai/gsm8k", split="test", sample_fields=FieldSpec(input="question", target="answer"))
)
model_a = inspect_model("openai/gpt-4o")
model_b = inspect_model("openai/gpt-4o-mini")
@bench.task(dataset=problems, name="gsm8k")
def gsm8k(problem):
return exact_match_score(problem, model_a(problem)), exact_match_score(problem, model_b(problem))
print(bench.run().summary())
Workflow 4: Embedding benchmarking with MTEB
Use when: comparing embedding models on STS/semantic similarity tasks.
from bayesbench import BayesianBenchmark
from bayesbench.posteriors import NormalPosterior
from bayesbench.adapters.mteb import mteb_sts_dataset, st_model, sts_score_fn
bench = BayesianBenchmark(confidence=0.95, posterior_factory=NormalPosterior)
result = bench.compare(
model_a=st_model("sentence-transformers/all-mpnet-base-v2"),
model_b=st_model("sentence-transformers/all-MiniLM-L6-v2"),
score_fn=sts_score_fn,
dataset=mteb_sts_dataset("STSBenchmark", max_samples=500),
name="mteb_sts",
)
print(result.summary())
Workflow 5: Agent-vs-agent with OpenClaw
Use when: evaluating full agent loops (planning, tools, retries) instead of raw completions.
from bayesbench import BayesianBenchmark
from bayesbench.adapters.openclaw import openclaw_agent
bench = BayesianBenchmark(confidence=0.95)
agent_a = openclaw_agent(react_agent)
agent_b = openclaw_agent(planner_executor_agent)
result = bench.compare(
model_a=agent_a,
model_b=agent_b,
dataset=tasks,
score_fn=lambda p, out: int(out.strip() == p["expected"]),
name="openclaw_agent_match",
)
print(result.summary())
Workflow tuning checklist
- Binary metric? Use default posterior (
BetaPosterior). - Continuous metric? Use
NormalPosterior. - High-stakes decisions? Increase
confidenceto0.99. - Noisy tasks? Increase
min_samples. - Reporting to stakeholders? Include both winner and efficiency.