EPOB End-to-End Project Orchestration Benchmark

Submission

Evaluation process

EPOB keeps results comparable by fixing the protocol, versioning reference anchors, and publishing evidence packets rather than relying on a single model-as-judge.

Tier 0

Submission Smoke

Validate schema, run bundle shape, tool boundaries, and scorer compatibility. Smoke runs do not enter the public leaderboard.

Tier 1

Public Comparable

Run a fixed task set, seed set, resource profile, timeout, evidence schema, and scorer version. These runs can enter public leaderboard snapshots.

Tier 2

Robustness Track

Expand across task families, providers, and model endpoints for deeper sensitivity analysis. This track is higher cost and not the default submission path.

Tier 3

Audit And Reproducibility

Rerun selected cells, preserve hashes, inspect evidence packets, and publish limitations when reviewer agreement or formal venue checks are incomplete.

Contact

Evaluation intake is open by email.

Send public comparable submission questions, artifact handoff requests, or evaluator access inquiries to info@epob.us.

Reference Anchors

Fixed protocol, versioned anchors