evaluation

Skills tagged with #evaluation

judge

Blind evaluation pattern. Agent spawns an independent judge to evaluate an idea, plan, strategy, statement, or implementation. The judge sees the repo but not the conversation. Verdict goes to a shared markdown file.

horiacristescu/claude-playbook-plugin+1 more

RAGScore

Generate QA datasets & evaluate RAG systems. Privacy-first, any LLM, local or cloud.

resolve-agent-reviews

Resolve PR review bot findings on current PR. Fetches unanswered bot comments, evaluates each finding, fixes real bugs, dismisses false positives, replies to every comment, and watches for new findings until bots go quiet.

pbakaus/agent-reviews+2 more

2mo ago

1050

@google-labs-code

fleet-triage

Cognitive triage of fleet audit findings. Read issue content, evaluate labeling accuracy, and determine open/close/dispatch/relabel actions for fleet-managed issues. Use when triaging undispatched issues or reviewing audit scan results.

google-labs-code/jules-sdk+5 more

2mo ago

440

@mnemox-ai

idea-check

Pre-build reality check. Scans GitHub, HN, npm, PyPI, and Product Hunt for existing competitors before you build. Use when starting a new project, evaluating a side project idea, or doing a build-vs-buy decision.

mnemox-ai/idea-reality-mcp+2 more

2mo ago

3210

@Fuenfgeld

pydantic-evals

Test and evaluate AI agents and LLM outputs using code-first evaluation framework with strong typing. Use when the user wants to: (1) Create evaluation datasets with test cases for AI agents, (2) Define evaluators (deterministic, LLM-as-Judge, custom, or span-based), (3) Run evaluations and generate reports, (4) Compare model performance across experiments, (5) Integrate evaluations with Pydantic AI agents, (6) Set up observability with Logfire, (7) Generate test datasets using LLMs, (8) Implement regression testing for AI systems.

Fuenfgeld/pydantic-ai-skills

12h ago

@redhat-community-ai-tools

A meta-repository for Claude Code users that includes workspace setup (skills, commands, hooks) and an evaluator for your Claude Code configuration

Deep-evaluate a single skill â static analysis, contextual rubric scoring, and A/B redundancy testing. Runs all 3 layers on one skill to determine if it earns its place.

redhat-community-ai-tools/claude-code-setup-evaluator+2 more

Judges Panel

45 judges that evaluate AI-generated code for security, cost, and quality with built-in AST.

Umbrela Eval

Analyze and compare umbrela evaluation results across backends, models, and configurations.

castorini/umbrela+3 more

2mo ago

540

@shanev

decomplect

Architectural code analysis for design quality. Evaluates simplicity (Rich Hickey), functional core/imperative shell (Gary Bernhardt), and coupling (Constantine & Yourdon). Use for design review or architectural assessment.

Debug Java Applications in Eclipse

Full interactive debugging using Eclipse's JDT debugger. Set breakpoints, step through code, inspect variables, evaluate expressions, and hot-swap changes.

gradusnikov/eclipse-chatgpt-plugin+3 more

2mo ago

1360

@AIDotNet

think

Deliberate reasoning skill: enforce multi-step analysis, hypothesis testing, and option evaluation before answering complex questions

AIDotNet/OpenDeepWiki

2mo ago

2.9K0

@PharMolix

admet-prediction

Predict comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties for drug candidate molecules using GraphMVP ensemble models. Use this skill when: (1) Predicting blood-brain barrier penetration, (2) Assessing side effect profiles, (3) Estimating Caco-2 permeability, half-life, or LD50 toxicity, (4) Evaluating drug-likeness and safety of molecules.

admettoxicitydrug-discoverypharmacokineticsgraphmvp

PharMolix/OpenBioMed+41 more

2mo ago

9780

@santifer

career-ops

AI job search command center -- evaluate offers, generate CVs, scan portals, track applications

santifer/career-ops+2 more

9h ago

5.4K0

@lizhiyao

Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.

lizhiyao/oh-my-knowledge+1 more

1mo ago

@collapseindex

MCP

Ci1t Mcp

CI-1T prediction stability engine. Detect ghosts, evaluate drift, monitor fleets. 20 tools.

mcpgithub

collapseindex/ci-1t-mcp

2mo ago

@redhat-community-ai-tools

Eval Skill

Deep-evaluate a single skill with static analysis and qualitative review, both individually and in context of the full setup.

redhat-community-ai-tools/harness-eval-lab+4 more

7h ago

@AMD-AGI

magpie

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

agent-project-development

This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches. NOT for evaluating agent quality or building evaluation rubrics (use agent-evaluation), NOT for multi-agent coordination or agent handoffs (use multi-agent-patterns).

viktorbezdek/skillstack+49 more

1mo ago

@greyhaven-ai

autocontext

Iterative strategy generation and evaluation system. Use when the user wants to evaluate agent output quality, run improvement loops, queue tasks for background evaluation, check run status, or discover available scenarios. Provides LLM-based judging with rubric-driven scoring.

greyhaven-ai/autocontext+1 more

Io.Github.Sind00/Flippa Mcp

Search, analyze, and evaluate online businesses for sale on Flippa.com marketplace.

predict-content

Predict what an article will say, then score how much it exceeded the prediction. Use when the user wants to evaluate whether an article is worth reading.

marcus-crane/dotfiles

2mo ago

390

@peterzat

codefix

Fix BLOCK and WARN findings from CODEREVIEW.md. Reads findings as a spec, applies minimal targeted fixes. Never evaluates its own work; that is the reviewer's job. Invoked by /codereview, not directly by users.

peterzat/zat.env+1 more

2mo ago

@ShaheerKhawaja

audit-and-fix

Composite: security audit -> production upgrade -> self-evaluation. Use when user says 'audit', 'check the codebase', 'find and fix issues', or 'is this production-ready'.

ShaheerKhawaja/ProductionOS+55 more

2mo ago

@existential-birds

12-factor-apps

Perform 12-Factor App compliance analysis on any codebase. Use when evaluating application architecture, auditing SaaS applications, or reviewing cloud-native applications against the original 12-Factor methodology.

existential-birds/beagle+66 more

1mo ago

380

@ZhangHanDong

agent-estimation

Accurately estimate AI agent work effort using the agent's own operational units (tool-call rounds) instead of human time. Use when asked to estimate, scope, plan, or evaluate how long a coding task will take. Prevents the common failure mode where agents anchor to human developer timelines and massively overestimate. Outputs a structured breakdown with round counts, risk factors, and a final wallclock conversion.

ZhangHanDong/agent-estimation

Scite

Ground answers in scientific literature. Search full text, evaluate trust, access full-text articles

mcpaisearch

2mo ago

@truera

trulens-evaluation-setup

Configure feedback functions and selectors for TruLens evaluations

trulensllmevaluationfeedbackselectors

truera/trulens+4 more

2mo ago

3.2K0

@stone16

Harness â Multi-Agent Orchestration

Orchestrate complex tasks through Planning â Generation ï¿½ï¿½ï¿½ Evaluation â Retro. Fresh sub-agents per checkpoint prevent drift. Retro accumulates learning across tasks.

stone16/harness-engineering-skills+1 more

2mo ago

@mitchdenny

api-reviewer

Guidelines for reviewing API design in the Hex1b codebase. Use when evaluating public APIs, reviewing accessibility modifiers, or assessing whether new APIs follow project conventions.

mitchdenny/hex1b+8 more

Io.Github.OtherVibes/Mcp As A Judge

MCP as a Judge: a behavioral MCP that strengthens AI coding assistants via explicit LLM evaluations

mcpgithubaillm

2mo ago

@flpbalada

business-model-canvas

Design and analyze business models using the Business Model Canvas framework. Use when evaluating startups, planning new products, pivoting existing businesses, or understanding how companies create and capture value.

flpbalada/my-opencode-config+29 more

2mo ago

1840

@iamtouchskyer

opc

OPC â One Person Company. Digraph-based task pipeline with independent multi-role evaluation. Builds, reviews, analyzes, and brainstorms with specialist agents. Every path ends with evaluation. /opc <task>, /opc -i <task>, /opc <role> [role...]

skill-conductor

Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.

smixs/skill-conductor

2mo ago

430

@dahlia

logtape

Use this skill when writing any code that uses LogTape for logging in JavaScript or TypeScript. Covers getting loggers, the structured message syntax, configuration, library author rules, context, lazy evaluation, testing, and common mistakes to avoid. Trigger whenever the user is adding logging to a project, debugging log output, or integrating LogTape with a framework.

customer-service-expert

Expert guidance for improving customer service assistants. Use when optimizing UX, response time, tone, wording, conversation flow, or evaluating customer service quality.

check-trajectories-workflow

Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.

UKGovernmentBEIS/inspect_evals+11 more

2mo ago

4060

@awni

fast-mlx

Optimize MLX code for performance and memory. Use when asked to implement or speed up MLX models or algorithms, reduce latency/throughput bottlenecks, tune lazy evaluation, type promotion, fast ops, compilation, memory use, or profiling.

add-benchmark

Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".

arc-quick

Lightweight 4-phase pipeline: Plan -> Work+Evaluate -> Review -> Mend. Chains devise --quick -> strive (with evaluator loop) -> appraise -> mend in one command. Work phase iterates up to max_iterations (default 3) with ward checks and quality signal detection between passes. Stagnation detection prevents infinite loops. Mend phase auto-fixes P1/P2 findings from the review TOME. Accepts a prompt string or existing plan file path. Recommends /rune:arc for complex plans (8+ tasks) unless --force is passed. Use when: "quick run", "fast pipeline", "plan and build", "nhanh", "chay nhanh", "quick arc", "simple pipeline", "4 steps", "plan work review mend", "quick", "arc-quick". <example> user: "/rune:arc-quick add a health check endpoint" assistant: "Starting quick pipeline: plan -> work -> review..." </example> <example> user: "/rune:arc-quick plans/my-plan.md" assistant: "Running quick pipeline on existing plan..." </example> <example> user: "/rune:arc-quick plans/complex-plan.md --force" assistant: "Force-running quick pipeline (skipping complexity warning)..." </example>

vinhnxv/rune+24 more

2mo ago

@safaiyeh

app-store-review

Evaluates code against Apple's App Store Review Guidelines. Use this skill when reviewing iOS, macOS, tvOS, watchOS, or visionOS app code (Swift, Objective-C, React Native, or Expo) to identify potential App Store rejection issues before submission. Triggers on tasks involving app review preparation, compliance checking, or App Store submission readiness.

safaiyeh/app-store-review-skill

2mo ago

420

@Done-0

value-realization

Analyze whether end users will discover clear value in product ideas. Use when: discussing product concepts, evaluating features, planning marketing strategies, analyzing user adoption problems, or when the user expresses uncertainty about product direction (e.g., 'evaluate this product idea', 'will users adopt this', 'why aren't users retaining', 'analyze the value proposition', 'product-market fit', 'user adoption analysis').

Done-0/value-realization

2mo ago

4960

@zoonk

zoonk-business

Business decision-making framework for AI agents. Use when making strategic decisions, evaluating trade-offs, or ensuring alignment with Zoonk's mission and values.

ai-evals

Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.

RefoundAI/lenny-skills+73 more

2mo ago

4230

@shihongDev

evalyn-analyze

Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance

shihongDev/evalyn+3 more

2mo ago

2550

@hechtcarmel

jetbrains-debugger

Guide for using JetBrains IDE Debugger MCP tools to programmatically debug applications. TRIGGER when ANY of these MCP tools are available: list_run_configurations, execute_run_configuration, start_debug_session, stop_debug_session, get_debug_session_status, list_debug_sessions, set_breakpoint, remove_breakpoint, list_breakpoints, resume_execution, pause_execution, step_over, step_into, step_out, run_to_line, get_stack_trace, select_stack_frame, list_threads, get_variables, set_variable, get_source_context, evaluate_expression. Use when debugging any application, investigating bugs, tracing execution flow, inspecting runtime state, or when the user says "debug", "breakpoint", "step through", "inspect variable", "why is this returning X", "trace execution", or similar debugging-related requests. PREFER the debugger over reading code and guessing when runtime behavior is unclear.

hechtcarmel/jetbrains-debugger-mcp-plugin

2mo ago

400

@Sibo-Zhao

OpenClaw Knowledge Coach

An OpenClaw-native skill for local knowledge retention. Import knowledge, generate practice, evaluate answers, and produce insight cards â all powered by the host agent's model, with **zero extra API key configuration**.

architectural-analysis

Performs deep architectural analysis of a specified module, directory, or feature area by examining structural coupling, data flow, concurrency patterns, risk, and SOLID alignment. Use when the user wants to assess, evaluate, or review the architecture, design quality, dependency structure, coupling, cohesion, or technical debt of an existing part of the codebase â including requests to audit module boundaries, check for architectural smells, or inform refactoring decisions. Requires a specific focus area (module, directory, or component) to analyze. Not for creating new project structures, scaffolding, or boilerplates. Not for investigating specific bugs, runtime errors, or failures â use investigate. Not for test planning â use test-planning. Not for file-level code review â use code-review. Not for writing documentation or architectural decision records.

testdouble/han+21 more

Mcp

MCP server providing OpenFeature SDK installation guides and OFREP flag evaluation

mcp

2mo ago

@NVIDIA-NeMo

byob

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

NVIDIA-NeMo/Evaluator+2 more

2mo ago

2290

evaluation

judge

RAGScore

resolve-agent-reviews

fleet-triage

idea-check

pydantic-evals

A meta-repository for Claude Code users that includes workspace setup (skills, commands, hooks) and an evaluator for your Claude Code configuration

Judges Panel

Umbrela Eval

decomplect

Debug Java Applications in Eclipse

think

admet-prediction

career-ops

Evaluation framework for LLM knowledge inputs — prompts, RAG corpora, skills, agent workflows. Fix the model, vary the artifact. Built-in statistical rigor: bootstrap CI, Krippendorff α, length-debias, saturation curves.

Ci1t Mcp

Eval Skill

magpie

agent-project-development

autocontext

Io.Github.Sind00/Flippa Mcp

predict-content

codefix

audit-and-fix

12-factor-apps

agent-estimation

Scite

trulens-evaluation-setup

Harness â Multi-Agent Orchestration

api-reviewer

Io.Github.OtherVibes/Mcp As A Judge

business-model-canvas

opc

skill-conductor

logtape

customer-service-expert

check-trajectories-workflow

fast-mlx

add-benchmark

arc-quick

app-store-review

value-realization

zoonk-business

ai-evals

evalyn-analyze

jetbrains-debugger

OpenClaw Knowledge Coach

architectural-analysis

Mcp

byob

Harness â Multi-Agent Orchestration