Scoring Reliability
Independent reviewers score anonymized packets to test whether the rubric produces consistent judgments from the same evidence.
Reviewer Onboarding
EPOB separates human scoring reliability from full experimental reproducibility. Scoring reviewers evaluate the same frozen run artifacts with the same rubric. Replication reviewers rerun the protocol under the same model, seed, harness, container, and resource profile.
Download the 60-packet blinded scoring sample with the review protocol, CSV template, JSONL manifest, and packet JSON files.
Independent reviewers score anonymized packets to test whether the rubric produces consistent judgments from the same evidence.
Independent replication reruns selected cells to test whether the benchmark protocol and frozen configuration reproduce comparable outcomes.
Scoring reliability does not require model access. Experimental reproducibility requires the execution environment and should preserve fresh run evidence.
info@epob.us can answer reviewer questions, confirm conflicts, and coordinate the expected time window.
Scoring Reliability
Scoring Output
Experimental Reproducibility
Workflow
Select scoring reliability, experimental reproducibility, or both. Do not ask scoring reviewers to rerun models.
Record conflicts, prior participation, and whether the reviewer has seen framework names or paper rankings.
Provide only the materials required for the selected path, with secrets removed and framework identities masked where possible.
Collect CSVs, adjudication notes, replication logs, hashes, and aggregate deltas before writing any reliability claim.