BACK_TO_ONGOING
INIT-05BetaLaunch: 2025-12Owner: ML Ops
Evaluation Suite
Benchmark models on your tasks. Reproducible runs and custom metrics.
/ Overview
Evaluation Suite lets you benchmark models on your own tasks. Define datasets, run reproducible evals, and track custom metrics. Compare runs side-by-side and plug results into your CI.
/ The_Problem
Off-the-shelf benchmarks don't match your product. You need evals on your data and your metrics—reproducible, versioned, and comparable across runs. Evaluation Suite gives you that without building it yourself.
/ Reproducible_evals
Define datasets and metrics once; run them across model versions and configs. Every eval is tied to a dataset version and code commit so you can reproduce and compare any run.
“We stopped arguing about which model was better. The eval suite runs on every PR and the numbers are the answer.”
Beta
Status
/ Roadmap
P1
Custom datasets
Upload, version, split
P2
Metrics & runs
Reproducible evals, history
P3
CI plugin
Block merge on regressions