Skills

All Skills

evaluation

Skills tagged with #evaluation

@horiacristescu

judge

Blind evaluation pattern. Agent spawns an independent judge to evaluate an idea, plan, strategy, statement, or implementation. The judge sees the repo but not the conversation. Verdict goes to a shared markdown file.

horiacristescu/claude-playbook-plugin+1 more
19d ago
100
@HZYAI
MCP

RAGScore

Generate QA datasets & evaluate RAG systems. Privacy-first, any LLM, local or cloud.

mcpgithubairagllm
HZYAI/RagScore
19d ago
0
@pbakaus

resolve-agent-reviews

Resolve PR review bot findings on current PR. Fetches unanswered bot comments, evaluates each finding, fixes real bugs, dismisses false positives, replies to every comment, and watches for new findings until bots go quiet.

pbakaus/agent-reviews+2 more
19d ago
1050
@google-labs-code

fleet-triage

Cognitive triage of fleet audit findings. Read issue content, evaluate labeling accuracy, and determine open/close/dispatch/relabel actions for fleet-managed issues. Use when triaging undispatched issues or reviewing audit scan results.

google-labs-code/jules-sdk+5 more
18d ago
440
@mnemox-ai

idea-check

Pre-build reality check. Scans GitHub, HN, npm, PyPI, and Product Hunt for existing competitors before you build. Use when starting a new project, evaluating a side project idea, or doing a build-vs-buy decision.

mnemox-ai/idea-reality-mcp+2 more
18d ago
3210
@kevinrabun
MCP

Judges Panel

45 judges that evaluate AI-generated code for security, cost, and quality with built-in AST.

mcpgithubai
kevinrabun/judges
19d ago
0
@castorini

Umbrela Eval

Analyze and compare umbrela evaluation results across backends, models, and configurations.

castorini/umbrela+3 more
18d ago
540
@shanev

decomplect

Architectural code analysis for design quality. Evaluates simplicity (Rich Hickey), functional core/imperative shell (Gary Bernhardt), and coupling (Constantine & Yourdon). Use for design review or architectural assessment.

shanev/skills+1 more
18d ago
710
@gradusnikov

Debug Java Applications in Eclipse

Full interactive debugging using Eclipse's JDT debugger. Set breakpoints, step through code, inspect variables, evaluate expressions, and hot-swap changes.

gradusnikov/eclipse-chatgpt-plugin+3 more
18d ago
1360
@AIDotNet

think

Deliberate reasoning skill: enforce multi-step analysis, hypothesis testing, and option evaluation before answering complex questions

AIDotNet/OpenDeepWiki
18d ago
2.9K0
@PharMolix

admet-prediction

Predict comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties for drug candidate molecules using GraphMVP ensemble models. Use this skill when: (1) Predicting blood-brain barrier penetration, (2) Assessing side effect profiles, (3) Estimating Caco-2 permeability, half-life, or LD50 toxicity, (4) Evaluating drug-likeness and safety of molecules.

admettoxicitydrug-discoverypharmacokineticsgraphmvp
PharMolix/OpenBioMed+41 more
18d ago
9780
@santifer

career-ops

AI job search command center -- evaluate offers, generate CVs, scan portals, track applications

santifer/career-ops+1 more
19d ago
5.4K0
@collapseindex
MCP

Ci1t Mcp

CI-1T prediction stability engine. Detect ghosts, evaluate drift, monitor fleets. 20 tools.

mcpgithub
collapseindex/ci-1t-mcp
19d ago
0
@AMD-AGI

magpie

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

AMD-AGI/Magpie
18d ago
450
@viktorbezdek

agent-project-development

This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches. NOT for evaluating agent quality or building evaluation rubrics (use agent-evaluation), NOT for multi-agent coordination or agent handoffs (use multi-agent-patterns).

viktorbezdek/skillstack+48 more
3d ago
50
@greyhaven-ai

autocontext

Iterative strategy generation and evaluation system. Use when the user wants to evaluate agent output quality, run improvement loops, queue tasks for background evaluation, check run status, or discover available scenarios. Provides LLM-based judging with rubric-driven scoring.

greyhaven-ai/autocontext+1 more
19d ago
6640
@sind00
MCP

Io.Github.Sind00/Flippa Mcp

Search, analyze, and evaluate online businesses for sale on Flippa.com marketplace.

mcpgithubsearch
sind00/flippa-mcp
19d ago
0
@marcus-crane

predict-content

Predict what an article will say, then score how much it exceeded the prediction. Use when the user wants to evaluate whether an article is worth reading.

marcus-crane/dotfiles
18d ago
390
@peterzat

codefix

Fix BLOCK and WARN findings from CODEREVIEW.md. Reads findings as a spec, applies minimal targeted fixes. Never evaluates its own work; that is the reviewer's job. Invoked by /codereview, not directly by users.

peterzat/zat.env+1 more
19d ago
50
@ShaheerKhawaja

audit-and-fix

Composite: security audit -> production upgrade -> self-evaluation. Use when user says 'audit', 'check the codebase', 'find and fix issues', or 'is this production-ready'.

ShaheerKhawaja/ProductionOS+55 more
18d ago
60
@existential-birds

12-factor-apps

Perform 12-Factor App compliance analysis on any codebase. Use when evaluating application architecture, auditing SaaS applications, or reviewing cloud-native applications against the original 12-Factor methodology.

existential-birds/beagle+61 more
6d ago
380
@ZhangHanDong

agent-estimation

Accurately estimate AI agent work effort using the agent's own operational units (tool-call rounds) instead of human time. Use when asked to estimate, scope, plan, or evaluate how long a coding task will take. Prevents the common failure mode where agents anchor to human developer timelines and massively overestimate. Outputs a structured breakdown with round counts, risk factors, and a final wallclock conversion.

ZhangHanDong/agent-estimation
18d ago
670
@mcp-registry
MCP

Scite

Ground answers in scientific literature. Search full text, evaluate trust, access full-text articles

mcpaisearch
19d ago
0
@truera

trulens-evaluation-setup

Configure feedback functions and selectors for TruLens evaluations

trulensllmevaluationfeedbackselectors
truera/trulens+4 more
18d ago
3.2K0
@stone16

Harness — Multi-Agent Orchestration

Orchestrate complex tasks through Planning → Generation ��� Evaluation → Retro. Fresh sub-agents per checkpoint prevent drift. Retro accumulates learning across tasks.

stone16/harness-engineering-skills+1 more
16d ago
90
@mitchdenny

api-reviewer

Guidelines for reviewing API design in the Hex1b codebase. Use when evaluating public APIs, reviewing accessibility modifiers, or assessing whether new APIs follow project conventions.

mitchdenny/hex1b+8 more
19d ago
1140
@mcp-registry
MCP

Io.Github.OtherVibes/Mcp As A Judge

MCP as a Judge: a behavioral MCP that strengthens AI coding assistants via explicit LLM evaluations

mcpgithubaillm
19d ago
0
@flpbalada

business-model-canvas

Design and analyze business models using the Business Model Canvas framework. Use when evaluating startups, planning new products, pivoting existing businesses, or understanding how companies create and capture value.

flpbalada/my-opencode-config+29 more
18d ago
1840
@iamtouchskyer

opc

OPC — One Person Company. Digraph-based task pipeline with independent multi-role evaluation. Builds, reviews, analyzes, and brainstorms with specialist agents. Every path ends with evaluation. /opc <task>, /opc -i <task>, /opc <role> [role...]

iamtouchskyer/opc
18d ago
840
@smixs

skill-conductor

Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.

smixs/skill-conductor
18d ago
430
@dahlia

logtape

Use this skill when writing any code that uses LogTape for logging in JavaScript or TypeScript. Covers getting loggers, the structured message syntax, configuration, library author rules, context, lazy evaluation, testing, and common mistakes to avoid. Trigger whenever the user is adding logging to a project, debugging log output, or integrating LogTape with a framework.

dahlia/logtape
18d ago
1.7K0
@GitHamza0206

customer-service-expert

Expert guidance for improving customer service assistants. Use when optimizing UX, response time, tone, wording, conversation flow, or evaluating customer service quality.

GitHamza0206/simba
19d ago
1.4K0
@UKGovernmentBEIS

check-trajectories-workflow

Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.

UKGovernmentBEIS/inspect_evals+11 more
18d ago
4060
@awni

fast-mlx

Optimize MLX code for performance and memory. Use when asked to implement or speed up MLX models or algorithms, reduce latency/throughput bottlenecks, tune lazy evaluation, type promotion, fast ops, compilation, memory use, or profiling.

awni/mlx-skills
18d ago
760
@NVIDIA-NeMo

add-benchmark

Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".

NVIDIA-NeMo/Gym
18d ago
7220
@vinhnxv

arc-quick

Lightweight 4-phase pipeline: Plan -> Work+Evaluate -> Review -> Mend. Chains devise --quick -> strive (with evaluator loop) -> appraise -> mend in one command. Work phase iterates up to max_iterations (default 3) with ward checks and quality signal detection between passes. Stagnation detection prevents infinite loops. Mend phase auto-fixes P1/P2 findings from the review TOME. Accepts a prompt string or existing plan file path. Recommends /rune:arc for complex plans (8+ tasks) unless --force is passed. Use when: "quick run", "fast pipeline", "plan and build", "nhanh", "chay nhanh", "quick arc", "simple pipeline", "4 steps", "plan work review mend", "quick", "arc-quick". <example> user: "/rune:arc-quick add a health check endpoint" assistant: "Starting quick pipeline: plan -> work -> review..." </example> <example> user: "/rune:arc-quick plans/my-plan.md" assistant: "Running quick pipeline on existing plan..." </example> <example> user: "/rune:arc-quick plans/complex-plan.md --force" assistant: "Force-running quick pipeline (skipping complexity warning)..." </example>

vinhnxv/rune+24 more
16d ago
50
@safaiyeh

app-store-review

Evaluates code against Apple's App Store Review Guidelines. Use this skill when reviewing iOS, macOS, tvOS, watchOS, or visionOS app code (Swift, Objective-C, React Native, or Expo) to identify potential App Store rejection issues before submission. Triggers on tasks involving app review preparation, compliance checking, or App Store submission readiness.

safaiyeh/app-store-review-skill
19d ago
420
@Done-0

value-realization

Analyze whether end users will discover clear value in product ideas. Use when: discussing product concepts, evaluating features, planning marketing strategies, analyzing user adoption problems, or when the user expresses uncertainty about product direction (e.g., 'evaluate this product idea', 'will users adopt this', 'why aren't users retaining', 'analyze the value proposition', 'product-market fit', 'user adoption analysis').

Done-0/value-realization
19d ago
4960
@zoonk

zoonk-business

Business decision-making framework for AI agents. Use when making strategic decisions, evaluating trade-offs, or ensuring alignment with Zoonk's mission and values.

zoonk/zoonk+6 more
18d ago
1020
@RefoundAI

ai-evals

Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.

RefoundAI/lenny-skills+73 more
18d ago
4230
@shihongDev

evalyn-analyze

Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance

shihongDev/evalyn+3 more
18d ago
2550
@hechtcarmel

jetbrains-debugger

Guide for using JetBrains IDE Debugger MCP tools to programmatically debug applications. TRIGGER when ANY of these MCP tools are available: list_run_configurations, execute_run_configuration, start_debug_session, stop_debug_session, get_debug_session_status, list_debug_sessions, set_breakpoint, remove_breakpoint, list_breakpoints, resume_execution, pause_execution, step_over, step_into, step_out, run_to_line, get_stack_trace, select_stack_frame, list_threads, get_variables, set_variable, get_source_context, evaluate_expression. Use when debugging any application, investigating bugs, tracing execution flow, inspecting runtime state, or when the user says "debug", "breakpoint", "step through", "inspect variable", "why is this returning X", "trace execution", or similar debugging-related requests. PREFER the debugger over reading code and guessing when runtime behavior is unclear.

hechtcarmel/jetbrains-debugger-mcp-plugin
18d ago
400
@Sibo-Zhao

OpenClaw Knowledge Coach

An OpenClaw-native skill for local knowledge retention. Import knowledge, generate practice, evaluate answers, and produce insight cards — all powered by the host agent's model, with **zero extra API key configuration**.

Sibo-Zhao/OpenPraxis
18d ago
3680
@mcp-registry
MCP

Mcp

MCP server providing OpenFeature SDK installation guides and OFREP flag evaluation

mcp
19d ago
0
@NVIDIA-NeMo

byob

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

NVIDIA-NeMo/Evaluator+2 more
18d ago
2290
@PrimeIntellect-ai

browse-environments

Discover and inspect verifiers environments through the Prime ecosystem. Use when asked to find environments on the Hub, compare options, inspect metadata, check action status, pull local copies for inspection, or choose environment starting points before evaluation, training, or migration work.

PrimeIntellect-ai/verifiers+5 more
19d ago
3.9K0
@RooCodeInc

evals-context

Provides context about the Roo Code evals system structure in this monorepo. Use when tasks mention "evals", "evaluation", "eval runs", "eval exercises", or working with the evals infrastructure. Helps distinguish between the evals execution system (packages/evals, apps/web-evals) and the public website evals display page (apps/web-roo-code/src/app/evals).

RooCodeInc/Roo-Code+2 more
18d ago
22.7K0
@jongwony

attend

Route upstream epistemic deficits and evaluate execution-time risks during AI operations. Scans for unresolved upstream protocol needs, materializes intent into tasks, classifies each for risk signals, delegates low-risk tasks to executor, and surfaces elevated-risk findings for user judgment. Type: (ExecutionBlind, User, EVALUATE, ExecutionContext) → SituatedExecution. Alias: Prosoche(προσοχή).

jongwony/epistemic-protocols+29 more
1d ago
800
@PaulKinlan

agent-reviews

Review and fix PR review bot findings on current PR, loop until resolved. Fetches unanswered bot comments, evaluates each finding, fixes real bugs, dismisses false positives, and replies to every comment with the outcome.

PaulKinlan/Co-do+5 more
18d ago
750
@0xSero

evidence-heavy-evaluator

Generate an evidence-first, read-only repository evaluation report with deterministic scoring and actionable recommendations. Use when the user asks to assess readiness, maintainability, release-readiness, documentation gaps, or engineering health and wants auditable artifacts (`json` + `markdown` + raw command logs).

0xSero/vllm-studio
19d ago
2890