Skills

Evaluation

Skills tagged with #Evaluation

@horiacristescu

judge

Blind evaluation pattern. Agent spawns an independent judge to evaluate an idea, plan, strategy, statement, or implementation. The judge sees the repo but not the conversation. Verdict goes to a shared markdown file.

horiacristescu/claude-playbook-plugin+1 more

RAGScore

Generate QA datasets & evaluate RAG systems. Privacy-first, any LLM, local or cloud.

mcpgithubairagllm

resolve-agent-reviews

Resolve PR review bot findings on current PR. Fetches unanswered bot comments, evaluates each finding, fixes real bugs, dismisses false positives, replies to every comment, and watches for new findings until bots go quiet.

pbakaus/agent-reviews+2 more

@google-labs-code

fleet-triage

Cognitive triage of fleet audit findings. Read issue content, evaluate labeling accuracy, and determine open/close/dispatch/relabel actions for fleet-managed issues. Use when triaging undispatched issues or reviewing audit scan results.

google-labs-code/jules-sdk+5 more

idea-check

Pre-build reality check. Scans GitHub, HN, npm, PyPI, and Product Hunt for existing competitors before you build. Use when starting a new project, evaluating a side project idea, or doing a build-vs-buy decision.

mnemox-ai/idea-reality-mcp+2 more

Judges Panel

45 judges that evaluate AI-generated code for security, cost, and quality with built-in AST.

kevinrabun/judges

Umbrela Eval

Analyze and compare umbrela evaluation results across backends, models, and configurations.

castorini/umbrela+3 more

decomplect

Architectural code analysis for design quality. Evaluates simplicity (Rich Hickey), functional core/imperative shell (Gary Bernhardt), and coupling (Constantine & Yourdon). Use for design review or architectural assessment.

shanev/skills+1 more

Debug Java Applications in Eclipse

Full interactive debugging using Eclipse's JDT debugger. Set breakpoints, step through code, inspect variables, evaluate expressions, and hot-swap changes.

gradusnikov/eclipse-chatgpt-plugin+3 more

think

Deliberate reasoning skill: enforce multi-step analysis, hypothesis testing, and option evaluation before answering complex questions

AIDotNet/OpenDeepWiki

admet-prediction

Predict comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties for drug candidate molecules using GraphMVP ensemble models. Use this skill when: (1) Predicting blood-brain barrier penetration, (2) Assessing side effect profiles, (3) Estimating Caco-2 permeability, half-life, or LD50 toxicity, (4) Evaluating drug-likeness and safety of molecules.

admettoxicitydrug-discoverypharmacokineticsgraphmvp

PharMolix/OpenBioMed+41 more

career-ops

AI job search command center -- evaluate offers, generate CVs, scan portals, track applications

santifer/career-ops+1 more

Ci1t Mcp

CI-1T prediction stability engine. Detect ghosts, evaluate drift, monitor fleets. 20 tools.

collapseindex/ci-1t-mcp

magpie

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

agent-project-development

This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches. NOT for evaluating agent quality or building evaluation rubrics (use agent-evaluation), NOT for multi-agent coordination or agent handoffs (use multi-agent-patterns).

viktorbezdek/skillstack+48 more

autocontext

Iterative strategy generation and evaluation system. Use when the user wants to evaluate agent output quality, run improvement loops, queue tasks for background evaluation, check run status, or discover available scenarios. Provides LLM-based judging with rubric-driven scoring.

greyhaven-ai/autocontext+1 more

Io.Github.Sind00/Flippa Mcp

Search, analyze, and evaluate online businesses for sale on Flippa.com marketplace.

mcpgithubsearch

sind00/flippa-mcp

predict-content

Predict what an article will say, then score how much it exceeded the prediction. Use when the user wants to evaluate whether an article is worth reading.

marcus-crane/dotfiles

codefix

Fix BLOCK and WARN findings from CODEREVIEW.md. Reads findings as a spec, applies minimal targeted fixes. Never evaluates its own work; that is the reviewer's job. Invoked by /codereview, not directly by users.

peterzat/zat.env+1 more

@ShaheerKhawaja

audit-and-fix

Composite: security audit -> production upgrade -> self-evaluation. Use when user says 'audit', 'check the codebase', 'find and fix issues', or 'is this production-ready'.

ShaheerKhawaja/ProductionOS+55 more

@existential-birds

12-factor-apps

Perform 12-Factor App compliance analysis on any codebase. Use when evaluating application architecture, auditing SaaS applications, or reviewing cloud-native applications against the original 12-Factor methodology.

existential-birds/beagle+61 more

agent-estimation

Accurately estimate AI agent work effort using the agent's own operational units (tool-call rounds) instead of human time. Use when asked to estimate, scope, plan, or evaluate how long a coding task will take. Prevents the common failure mode where agents anchor to human developer timelines and massively overestimate. Outputs a structured breakdown with round counts, risk factors, and a final wallclock conversion.

ZhangHanDong/agent-estimation

Scite

Ground answers in scientific literature. Search full text, evaluate trust, access full-text articles

trulens-evaluation-setup

Configure feedback functions and selectors for TruLens evaluations

trulensllmevaluationfeedbackselectors

truera/trulens+4 more

Harness â Multi-Agent Orchestration

Orchestrate complex tasks through Planning â Generation ï¿½ï¿½ï¿½ Evaluation â Retro. Fresh sub-agents per checkpoint prevent drift. Retro accumulates learning across tasks.

stone16/harness-engineering-skills+1 more

api-reviewer

Guidelines for reviewing API design in the Hex1b codebase. Use when evaluating public APIs, reviewing accessibility modifiers, or assessing whether new APIs follow project conventions.

mitchdenny/hex1b+8 more

Io.Github.OtherVibes/Mcp As A Judge

MCP as a Judge: a behavioral MCP that strengthens AI coding assistants via explicit LLM evaluations

business-model-canvas

Design and analyze business models using the Business Model Canvas framework. Use when evaluating startups, planning new products, pivoting existing businesses, or understanding how companies create and capture value.

flpbalada/my-opencode-config+29 more

opc

OPC â One Person Company. Digraph-based task pipeline with independent multi-role evaluation. Builds, reviews, analyzes, and brainstorms with specialist agents. Every path ends with evaluation. /opc <task>, /opc -i <task>, /opc <role> [role...]

iamtouchskyer/opc

skill-conductor

Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.

smixs/skill-conductor

logtape

Use this skill when writing any code that uses LogTape for logging in JavaScript or TypeScript. Covers getting loggers, the structured message syntax, configuration, library author rules, context, lazy evaluation, testing, and common mistakes to avoid. Trigger whenever the user is adding logging to a project, debugging log output, or integrating LogTape with a framework.

customer-service-expert

Expert guidance for improving customer service assistants. Use when optimizing UX, response time, tone, wording, conversation flow, or evaluating customer service quality.

GitHamza0206/simba

@UKGovernmentBEIS

check-trajectories-workflow

Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.

UKGovernmentBEIS/inspect_evals+11 more

fast-mlx

Optimize MLX code for performance and memory. Use when asked to implement or speed up MLX models or algorithms, reduce latency/throughput bottlenecks, tune lazy evaluation, type promotion, fast ops, compilation, memory use, or profiling.

awni/mlx-skills

add-benchmark

Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".

NVIDIA-NeMo/Gym

arc-quick

Lightweight 4-phase pipeline: Plan -> Work+Evaluate -> Review -> Mend. Chains devise --quick -> strive (with evaluator loop) -> appraise -> mend in one command. Work phase iterates up to max_iterations (default 3) with ward checks and quality signal detection between passes. Stagnation detection prevents infinite loops. Mend phase auto-fixes P1/P2 findings from the review TOME. Accepts a prompt string or existing plan file path. Recommends /rune:arc for complex plans (8+ tasks) unless --force is passed. Use when: "quick run", "fast pipeline", "plan and build", "nhanh", "chay nhanh", "quick arc", "simple pipeline", "4 steps", "plan work review mend", "quick", "arc-quick". <example> user: "/rune:arc-quick add a health check endpoint" assistant: "Starting quick pipeline: plan -> work -> review..." </example> <example> user: "/rune:arc-quick plans/my-plan.md" assistant: "Running quick pipeline on existing plan..." </example> <example> user: "/rune:arc-quick plans/complex-plan.md --force" assistant: "Force-running quick pipeline (skipping complexity warning)..." </example>

vinhnxv/rune+24 more

app-store-review

Evaluates code against Apple's App Store Review Guidelines. Use this skill when reviewing iOS, macOS, tvOS, watchOS, or visionOS app code (Swift, Objective-C, React Native, or Expo) to identify potential App Store rejection issues before submission. Triggers on tasks involving app review preparation, compliance checking, or App Store submission readiness.

safaiyeh/app-store-review-skill

value-realization

Analyze whether end users will discover clear value in product ideas. Use when: discussing product concepts, evaluating features, planning marketing strategies, analyzing user adoption problems, or when the user expresses uncertainty about product direction (e.g., 'evaluate this product idea', 'will users adopt this', 'why aren't users retaining', 'analyze the value proposition', 'product-market fit', 'user adoption analysis').

Done-0/value-realization

zoonk-business

Business decision-making framework for AI agents. Use when making strategic decisions, evaluating trade-offs, or ensuring alignment with Zoonk's mission and values.

zoonk/zoonk+6 more

ai-evals

Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.

RefoundAI/lenny-skills+73 more

evalyn-analyze

Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance

shihongDev/evalyn+3 more

jetbrains-debugger

Guide for using JetBrains IDE Debugger MCP tools to programmatically debug applications. TRIGGER when ANY of these MCP tools are available: list_run_configurations, execute_run_configuration, start_debug_session, stop_debug_session, get_debug_session_status, list_debug_sessions, set_breakpoint, remove_breakpoint, list_breakpoints, resume_execution, pause_execution, step_over, step_into, step_out, run_to_line, get_stack_trace, select_stack_frame, list_threads, get_variables, set_variable, get_source_context, evaluate_expression. Use when debugging any application, investigating bugs, tracing execution flow, inspecting runtime state, or when the user says "debug", "breakpoint", "step through", "inspect variable", "why is this returning X", "trace execution", or similar debugging-related requests. PREFER the debugger over reading code and guessing when runtime behavior is unclear.

hechtcarmel/jetbrains-debugger-mcp-plugin

OpenClaw Knowledge Coach

An OpenClaw-native skill for local knowledge retention. Import knowledge, generate practice, evaluate answers, and produce insight cards â all powered by the host agent's model, with **zero extra API key configuration**.

Sibo-Zhao/OpenPraxis

Mcp

MCP server providing OpenFeature SDK installation guides and OFREP flag evaluation

byob

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

NVIDIA-NeMo/Evaluator+2 more

@PrimeIntellect-ai

browse-environments

Discover and inspect verifiers environments through the Prime ecosystem. Use when asked to find environments on the Hub, compare options, inspect metadata, check action status, pull local copies for inspection, or choose environment starting points before evaluation, training, or migration work.

PrimeIntellect-ai/verifiers+5 more

evals-context

Provides context about the Roo Code evals system structure in this monorepo. Use when tasks mention "evals", "evaluation", "eval runs", "eval exercises", or working with the evals infrastructure. Helps distinguish between the evals execution system (packages/evals, apps/web-evals) and the public website evals display page (apps/web-roo-code/src/app/evals).

RooCodeInc/Roo-Code+2 more

attend

Route upstream epistemic deficits and evaluate execution-time risks during AI operations. Scans for unresolved upstream protocol needs, materializes intent into tasks, classifies each for risk signals, delegates low-risk tasks to executor, and surfaces elevated-risk findings for user judgment. Type: (ExecutionBlind, User, EVALUATE, ExecutionContext) â SituatedExecution. Alias: Prosoche(ÏÏÎ¿ÏÎ¿ÏÎ®).

jongwony/epistemic-protocols+29 more

agent-reviews

Review and fix PR review bot findings on current PR, loop until resolved. Fetches unanswered bot comments, evaluates each finding, fixes real bugs, dismisses false positives, and replies to every comment with the outcome.

PaulKinlan/Co-do+5 more

evidence-heavy-evaluator

Generate an evidence-first, read-only repository evaluation report with deterministic scoring and actionable recommendations. Use when the user asks to assess readiness, maintainability, release-readiness, documentation gaps, or engineering health and wants auditable artifacts (`json` + `markdown` + raw command logs).

0xSero/vllm-studio

Workflows MCP Collections Publishers Docs