Run Evaluation

Name: Run Evaluation
Availability: InStock
Rating: 5 (5 reviews)
Author: przeprogramowani

by @przeprogramowani0 pulls

URLopenbooklet.com/s/run-evaluation

Pinnedopenbooklet.com/s/run-evaluation@1.0.0

APIGET /api/v1/skills/run-evaluation

Evaluate a Przeprogramowani website implementation against benchmark criteria.

6 skills from this repoprzeprogramowani/10x-bench

Run Evaluationviewing

10x Run Evaluation.claude/skills/10x-run-eval/SKILL.md

Evaluate a Przeprogramowani website implementation against benchmark criteria.

10x-eval-model.claude/skills/10x-eval-model/SKILL.md

Set up and run benchmark evaluations for new LLM models in the 10xBench project. Use when the user wants to add a new model to the benchmark, prepare evaluation directories, update metadata, or launch evaluation runs. Triggers on phrases like "eval model", "add model to benchmark", "run benchmark for [model]", "evaluate [model-name]", "set up [model] for eval", or any request involving adding a new model to the Przeprogramowani.pl benchmark pipeline.

design-eval-case.claude/skills/design-eval-case/SKILL.md

Design a stack-agnostic programming evaluation case for live AI coding evals. Use when the user wants to create a benchmark prompt, context, optional bootstrap or baseline instructions, and scorecard for any programming task, especially during webinars or workshops where participants choose the stack, task, constraints, and scoring criteria. Triggers on phrases like "design eval case", "zaprojektuj eval", "stworz benchmark case", "live eval design", "scorecard dla zadania", "programming eval", or requests to define an AI coding benchmark from scratch.

eval-model.claude/skills/eval-model/SKILL.md

score-eval-attempt.claude/skills/score-eval-attempt/SKILL.md

Score a completed programming evaluation attempt against a benchmark scorecard. Use when the user wants to evaluate a generated model attempt, produce eval-results.csv, inspect build/test/runtime behavior, or compare attempts using the scorecard created by design-eval-case. Triggers on phrases like "score eval attempt", "ocen probe", "evaluate attempt", "score model output", "wygeneruj eval-results", "ocen wynik modelu", or requests to grade an eval-attempts directory.

Auto-indexed from przeprogramowani/10x-bench

Are you the author? Claim this skill to take ownership and manage it.

Related Skills

@openbooklet

graceful-error-recovery

Use this skill when a tool call, command, or API request fails. Diagnose the root cause systematically before retrying or changing approach. Do not retry the same failing call without first understanding why it failed.

1.1K0

@openbooklet

audience-aware-communication

Use this skill when writing any explanation, documentation, or response that will be read by someone else. Match vocabulary, depth, and format to the audience's expertise level before writing.

1.1K0

@openbooklet

Refactoring Expert

Expert in systematic code refactoring, code smell detection, and structural optimization. Use PROACTIVELY when encountering duplicated code, long methods, complex conditionals, or any code quality issues. Detects code smells and applies proven refactoring techniques without changing external behavior.

600

@openbooklet

Research Expert

Specialized research expert for parallel information gathering. Use for focused research tasks with clear objectives and structured output requirements.

600

@openbooklet

clarify-ambiguous-requests

Use this skill when the user's request is ambiguous, under-specified, or could be interpreted in multiple ways. If proceeding with a wrong assumption would waste significant work, always ask exactly one focused clarifying question before doing anything.

1.1K0

@openbooklet

structured-step-by-step-reasoning

Use this skill for any problem that involves multiple steps, tradeoffs, or non-trivial logic. Think out loud before answering to improve accuracy and transparency. Apply whenever the answer is not immediately obvious.

1.1K0