EPOB End-to-End Project Orchestration Benchmark

Reviewer Onboarding

Reliability and reproducibility paths

EPOB separates human scoring reliability from full experimental reproducibility. Scoring reviewers evaluate the same frozen run artifacts with the same rubric. Replication reviewers rerun the protocol under the same model, seed, harness, container, and resource profile.

Direct download

Reviewer packet

Download the 60-packet blinded scoring sample with the review protocol, CSV template, JSONL manifest, and packet JSON files.

Path A

Scoring Reliability

Independent reviewers score anonymized packets to test whether the rubric produces consistent judgments from the same evidence.

Path B

Experimental Reproducibility

Independent replication reruns selected cells to test whether the benchmark protocol and frozen configuration reproduce comparable outcomes.

Boundary

Do not mix roles

Scoring reliability does not require model access. Experimental reproducibility requires the execution environment and should preserve fresh run evidence.

Contact

Start by email

info@epob.us can answer reviewer questions, confirm conflicts, and coordinate the expected time window.

Scoring Reliability

What independent scorers receive

Scoring Output

What the reliability study produces

Experimental Reproducibility

What an independent replication needs

Workflow

One-page onboarding checklist

1

Choose path

Select scoring reliability, experimental reproducibility, or both. Do not ask scoring reviewers to rerun models.

2

Confirm independence

Record conflicts, prior participation, and whether the reviewer has seen framework names or paper rankings.

3

Distribute packet

Provide only the materials required for the selected path, with secrets removed and framework identities masked where possible.

4

Return evidence

Collect CSVs, adjudication notes, replication logs, hashes, and aggregate deltas before writing any reliability claim.