eval-report-workflow
openbooklet.com/s/eval-report-workflowopenbooklet.com/s/eval-report-workflow@1.0.0GET /api/v1/skills/eval-report-workflowCreate an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
This skill covers three CI/GitHub Actions-related workflows:
Fix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
Review all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead).
Implement a new evaluation from an issue or paper. Guides through requirements gathering, architecture design, scaffolding, implementation, testing, and quality checks. Use when user asks to create/implement/build a new evaluation.
Pick a specific area of the codebase and go deep: follow PR threads, review comments, linked issues, and commit messages. Merge any genuinely useful findings into `agent_artefacts/repo_context/REPO_CONTEXT.md`.
Fix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
Investigate datasets from HuggingFace, CSV, or JSON files to understand their structure, fields, and data quality. Trigger whenever you need to explore or inspect a dataset yourself without using pre-written scripts.
View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.
Write an Architectural Decision Record (ADR) to document a significant design choice. Use when user asks to write/create/document an ADR, or to record an architectural decision.
Auto-indexed from UKGovernmentBEIS/inspect_evals
Are you the author? Claim this skill to take ownership and manage it.
Related Skills
graceful-error-recovery
Use this skill when a tool call, command, or API request fails. Diagnose the root cause systematically before retrying or changing approach. Do not retry the same failing call without first understanding why it failed.
audience-aware-communication
Use this skill when writing any explanation, documentation, or response that will be read by someone else. Match vocabulary, depth, and format to the audience's expertise level before writing.
Refactoring Expert
Expert in systematic code refactoring, code smell detection, and structural optimization. Use PROACTIVELY when encountering duplicated code, long methods, complex conditionals, or any code quality issues. Detects code smells and applies proven refactoring techniques without changing external behavior.
Research Expert
Specialized research expert for parallel information gathering. Use for focused research tasks with clear objectives and structured output requirements.
clarify-ambiguous-requests
Use this skill when the user's request is ambiguous, under-specified, or could be interpreted in multiple ways. If proceeding with a wrong assumption would waste significant work, always ask exactly one focused clarifying question before doing anything.
structured-step-by-step-reasoning
Use this skill for any problem that involves multiple steps, tradeoffs, or non-trivial logic. Think out loud before answering to improve accuracy and transparency. Apply whenever the answer is not immediately obvious.