Examples gallery

The repository ships runnable examples under examples/. Use these as practical templates and adapt to your own tasks.

1) `quickstart.py` — smallest useful benchmark

Best when you want to verify setup quickly before integrating providers.

Best when comparing two text-generation models using exact match or rubric scoring.

Best when you need a leaderboard rather than only pairwise A/B.

Best when your models and datasets come from external frameworks.

Focus: adapters that normalize framework-specific APIs into benchmark callables.
Good for teams with existing evaluation stacks.

Best when you already use Inspect datasets/tasks and want Bayesian stopping.

Best when evaluating embedding quality on STS-style tasks.

Run quickstart.py.
Choose either llm_comparison.py or mteb_example.py based on your metric type.
Move to multi_model_ranking.py when comparing 3+ models.
Adopt adapter examples to connect your production providers.