Skip to content

API reference

Core classes

BayesianBenchmark

BayesianBenchmark(
    confidence: float = 0.95,
    skip_threshold: float = 0.85,
    min_samples: int = 3,
    posterior_factory: Callable = BetaPosterior,
)
Method Returns Description
.task(name, dataset, posterior_factory) decorator Register a named evaluation task
.compare(model_a, model_b, score_fn, dataset) TaskResult Direct pairwise comparison
.compare_async(...) TaskResult Async pairwise comparison
.run(verbose=False) BenchmarkReport Run all registered tasks
.run_async() BenchmarkReport Async run over registered tasks

BayesianRanker

BayesianRanker(
    confidence: float = 0.95,
    skip_threshold: float = 0.85,
    min_samples: int = 5,
    posterior_factory: Callable = BetaPosterior,
)
Method Returns Description
.add_model(name, fn) self Register a model callable
.evaluate decorator Register scoring function
.rank(dataset, score_fn, verbose=False) RankingResult Rank models with Bayesian comparisons
.rank_async(dataset, score_fn) RankingResult Async ranking

Result objects

TaskResult

Attribute Description
winner "model_a", "model_b", or None
p_a_beats_b Posterior probability that A beats B
efficiency Fraction of evaluations saved
problems_tested Number of evaluated problems
total_problems Total dataset size
posterior_a, posterior_b Final posterior objects
skipped Whether task was skipped as non-discriminating

BenchmarkReport

Attribute / Method Description
task_results List of task-level outcomes
overall_efficiency Aggregate fraction of evaluations saved
winners Mapping from task name to winner
summary() Text summary for quick review
to_dict() Serialize report to a plain dictionary
to_dataframe() Export to pandas DataFrame (if pandas is installed)

CLI

bayesbench my_benchmark.py
bayesbench my_benchmark.py --confidence 0.99 --min-samples 10 --skip-threshold 0.90
bayesbench --version

Benchmark files should expose either:

  • bench = BayesianBenchmark(...), or
  • a @suite-decorated class.