Reviewer Onboarding

Reliability and reproducibility paths

EPOB separates human scoring reliability from full experimental reproducibility. Scoring reviewers evaluate the same frozen run artifacts with the same rubric. Replication reviewers rerun the protocol under the same model, seed, harness, container, and resource profile.

Direct download

Reviewer packet

Download the 60-packet blinded scoring sample with the review protocol, CSV template, JSONL manifest, and packet JSON files.

Download reviewer packet SHA-256 checksum

Path A

Scoring Reliability

Independent reviewers score anonymized packets to test whether the rubric produces consistent judgments from the same evidence.

Path B

Experimental Reproducibility

Independent replication reruns selected cells to test whether the benchmark protocol and frozen configuration reproduce comparable outcomes.

Boundary

Do not mix roles

Scoring reliability does not require model access. Experimental reproducibility requires the execution environment and should preserve fresh run evidence.

Contact

Start by email

info@epob.us can answer reviewer questions, confirm conflicts, and coordinate the expected time window.

Scoring Reliability

What independent scorers receive

The same frozen run artifacts, including task description, constraints, trace, validation output, and final deliverables.
A fixed scoring rubric for Plan Quality, Assignment Quality, Coordination, Deliverable Quality, Efficiency, invalid-run flags, and failure labels.
Calibration examples that show how evidence maps to score bands without revealing the official ranking.
An anonymized scoring CSV template using framework aliases such as F01, F02, and F03.
A conflict rule confirming the reviewer did not design EPOB, generate the run evidence, or write the benchmark claims being assessed.

Scoring Output

What the reliability study produces

Raw independent reviewer CSVs are preserved before averaging or adjudication.
Inter-rater reliability is reported for continuous category scores where enough paired ratings exist.
Invalid-run and failure-label agreement are reported separately from numeric score agreement.
Adjudication notes explain large disagreements by returning to the trace and artifact packet rather than changing the rubric after the fact.
The paper may claim independent scoring reliability only for the reviewed sample and only after these statistics are complete.

Experimental Reproducibility

What an independent replication needs

The same task instances, dataset slice, scorer version, timeout policy, and evidence schema.
The same model, seed, harness, container, and resource profile used by the frozen comparison cell.
Access to the benchmark runner, container image or Dockerfile, environment template, and non-secret configuration manifest.
A requirement to preserve fresh run traces, final artifacts, hashes, logs, and aggregate comparison outputs.
A disclosure path for blocked cells, invalid runs, provider drift, model-endpoint changes, or hardware differences.

Workflow

One-page onboarding checklist

1

Choose path

Select scoring reliability, experimental reproducibility, or both. Do not ask scoring reviewers to rerun models.

2

Confirm independence

Record conflicts, prior participation, and whether the reviewer has seen framework names or paper rankings.

3

Distribute packet

Provide only the materials required for the selected path, with secrets removed and framework identities masked where possible.

4

Return evidence

Collect CSVs, adjudication notes, replication logs, hashes, and aggregate deltas before writing any reliability claim.