judge
Blind evaluation pattern. Agent spawns an independent judge to evaluate an idea, plan, strategy, statement, or implementation. The judge sees the repo but not the conversation. Verdict goes to a shared markdown file.
RAGScore
Generate QA datasets & evaluate RAG systems. Privacy-first, any LLM, local or cloud.
resolve-agent-reviews
Resolve PR review bot findings on current PR. Fetches unanswered bot comments, evaluates each finding, fixes real bugs, dismisses false positives, replies to every comment, and watches for new findings until bots go quiet.
fleet-triage
Cognitive triage of fleet audit findings. Read issue content, evaluate labeling accuracy, and determine open/close/dispatch/relabel actions for fleet-managed issues. Use when triaging undispatched issues or reviewing audit scan results.
idea-check
Pre-build reality check. Scans GitHub, HN, npm, PyPI, and Product Hunt for existing competitors before you build. Use when starting a new project, evaluating a side project idea, or doing a build-vs-buy decision.
Judges Panel
45 judges that evaluate AI-generated code for security, cost, and quality with built-in AST.
Umbrela Eval
Analyze and compare umbrela evaluation results across backends, models, and configurations.
decomplect
Architectural code analysis for design quality. Evaluates simplicity (Rich Hickey), functional core/imperative shell (Gary Bernhardt), and coupling (Constantine & Yourdon). Use for design review or architectural assessment.
Debug Java Applications in Eclipse
Full interactive debugging using Eclipse's JDT debugger. Set breakpoints, step through code, inspect variables, evaluate expressions, and hot-swap changes.
think
Deliberate reasoning skill: enforce multi-step analysis, hypothesis testing, and option evaluation before answering complex questions
admet-prediction
Predict comprehensive ADMET (Absorption, Distribution, Metabolism, Excretion, Toxicity) properties for drug candidate molecules using GraphMVP ensemble models. Use this skill when: (1) Predicting blood-brain barrier penetration, (2) Assessing side effect profiles, (3) Estimating Caco-2 permeability, half-life, or LD50 toxicity, (4) Evaluating drug-likeness and safety of molecules.
career-ops
AI job search command center -- evaluate offers, generate CVs, scan portals, track applications
Ci1t Mcp
CI-1T prediction stability engine. Detect ghosts, evaluate drift, monitor fleets. 20 tools.
magpie
Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.
agent-project-development
This skill should be used when the user asks to "start an LLM project", "design batch pipeline", "evaluate task-model fit", "structure agent project", or mentions pipeline architecture, agent-assisted development, cost estimation, or choosing between LLM and traditional approaches. NOT for evaluating agent quality or building evaluation rubrics (use agent-evaluation), NOT for multi-agent coordination or agent handoffs (use multi-agent-patterns).
autocontext
Iterative strategy generation and evaluation system. Use when the user wants to evaluate agent output quality, run improvement loops, queue tasks for background evaluation, check run status, or discover available scenarios. Provides LLM-based judging with rubric-driven scoring.
Io.Github.Sind00/Flippa Mcp
Search, analyze, and evaluate online businesses for sale on Flippa.com marketplace.
predict-content
Predict what an article will say, then score how much it exceeded the prediction. Use when the user wants to evaluate whether an article is worth reading.
codefix
Fix BLOCK and WARN findings from CODEREVIEW.md. Reads findings as a spec, applies minimal targeted fixes. Never evaluates its own work; that is the reviewer's job. Invoked by /codereview, not directly by users.
audit-and-fix
Composite: security audit -> production upgrade -> self-evaluation. Use when user says 'audit', 'check the codebase', 'find and fix issues', or 'is this production-ready'.
12-factor-apps
Perform 12-Factor App compliance analysis on any codebase. Use when evaluating application architecture, auditing SaaS applications, or reviewing cloud-native applications against the original 12-Factor methodology.
agent-estimation
Accurately estimate AI agent work effort using the agent's own operational units (tool-call rounds) instead of human time. Use when asked to estimate, scope, plan, or evaluate how long a coding task will take. Prevents the common failure mode where agents anchor to human developer timelines and massively overestimate. Outputs a structured breakdown with round counts, risk factors, and a final wallclock conversion.
Scite
Ground answers in scientific literature. Search full text, evaluate trust, access full-text articles
trulens-evaluation-setup
Configure feedback functions and selectors for TruLens evaluations
Harness â Multi-Agent Orchestration
Orchestrate complex tasks through Planning â Generation ��� Evaluation â Retro. Fresh sub-agents per checkpoint prevent drift. Retro accumulates learning across tasks.
api-reviewer
Guidelines for reviewing API design in the Hex1b codebase. Use when evaluating public APIs, reviewing accessibility modifiers, or assessing whether new APIs follow project conventions.
Io.Github.OtherVibes/Mcp As A Judge
MCP as a Judge: a behavioral MCP that strengthens AI coding assistants via explicit LLM evaluations
business-model-canvas
Design and analyze business models using the Business Model Canvas framework. Use when evaluating startups, planning new products, pivoting existing businesses, or understanding how companies create and capture value.
opc
OPC â One Person Company. Digraph-based task pipeline with independent multi-role evaluation. Builds, reviews, analyzes, and brainstorms with specialist agents. Every path ends with evaluation. /opc <task>, /opc -i <task>, /opc <role> [role...]
skill-conductor
Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.
logtape
Use this skill when writing any code that uses LogTape for logging in JavaScript or TypeScript. Covers getting loggers, the structured message syntax, configuration, library author rules, context, lazy evaluation, testing, and common mistakes to avoid. Trigger whenever the user is adding logging to a project, debugging log output, or integrating LogTape with a framework.
customer-service-expert
Expert guidance for improving customer service assistants. Use when optimizing UX, response time, tone, wording, conversation flow, or evaluating customer service quality.
check-trajectories-workflow
Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
fast-mlx
Optimize MLX code for performance and memory. Use when asked to implement or speed up MLX models or algorithms, reduce latency/throughput bottlenecks, tune lazy evaluation, type promotion, fast ops, compilation, memory use, or profiling.
add-benchmark
Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".
arc-quick
Lightweight 4-phase pipeline: Plan -> Work+Evaluate -> Review -> Mend. Chains devise --quick -> strive (with evaluator loop) -> appraise -> mend in one command. Work phase iterates up to max_iterations (default 3) with ward checks and quality signal detection between passes. Stagnation detection prevents infinite loops. Mend phase auto-fixes P1/P2 findings from the review TOME. Accepts a prompt string or existing plan file path. Recommends /rune:arc for complex plans (8+ tasks) unless --force is passed. Use when: "quick run", "fast pipeline", "plan and build", "nhanh", "chay nhanh", "quick arc", "simple pipeline", "4 steps", "plan work review mend", "quick", "arc-quick". <example> user: "/rune:arc-quick add a health check endpoint" assistant: "Starting quick pipeline: plan -> work -> review..." </example> <example> user: "/rune:arc-quick plans/my-plan.md" assistant: "Running quick pipeline on existing plan..." </example> <example> user: "/rune:arc-quick plans/complex-plan.md --force" assistant: "Force-running quick pipeline (skipping complexity warning)..." </example>
app-store-review
Evaluates code against Apple's App Store Review Guidelines. Use this skill when reviewing iOS, macOS, tvOS, watchOS, or visionOS app code (Swift, Objective-C, React Native, or Expo) to identify potential App Store rejection issues before submission. Triggers on tasks involving app review preparation, compliance checking, or App Store submission readiness.
value-realization
Analyze whether end users will discover clear value in product ideas. Use when: discussing product concepts, evaluating features, planning marketing strategies, analyzing user adoption problems, or when the user expresses uncertainty about product direction (e.g., 'evaluate this product idea', 'will users adopt this', 'why aren't users retaining', 'analyze the value proposition', 'product-market fit', 'user adoption analysis').
zoonk-business
Business decision-making framework for AI agents. Use when making strategic decisions, evaluating trade-offs, or ensuring alignment with Zoonk's mission and values.
ai-evals
Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.
evalyn-analyze
Use when analyzing evalyn evaluation results, investigating failures, comparing runs, or understanding agent performance
jetbrains-debugger
Guide for using JetBrains IDE Debugger MCP tools to programmatically debug applications. TRIGGER when ANY of these MCP tools are available: list_run_configurations, execute_run_configuration, start_debug_session, stop_debug_session, get_debug_session_status, list_debug_sessions, set_breakpoint, remove_breakpoint, list_breakpoints, resume_execution, pause_execution, step_over, step_into, step_out, run_to_line, get_stack_trace, select_stack_frame, list_threads, get_variables, set_variable, get_source_context, evaluate_expression. Use when debugging any application, investigating bugs, tracing execution flow, inspecting runtime state, or when the user says "debug", "breakpoint", "step through", "inspect variable", "why is this returning X", "trace execution", or similar debugging-related requests. PREFER the debugger over reading code and guessing when runtime behavior is unclear.
OpenClaw Knowledge Coach
An OpenClaw-native skill for local knowledge retention. Import knowledge, generate practice, evaluate answers, and produce insight cards â all powered by the host agent's model, with **zero extra API key configuration**.
Mcp
MCP server providing OpenFeature SDK installation guides and OFREP flag evaluation
byob
Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.
browse-environments
Discover and inspect verifiers environments through the Prime ecosystem. Use when asked to find environments on the Hub, compare options, inspect metadata, check action status, pull local copies for inspection, or choose environment starting points before evaluation, training, or migration work.
evals-context
Provides context about the Roo Code evals system structure in this monorepo. Use when tasks mention "evals", "evaluation", "eval runs", "eval exercises", or working with the evals infrastructure. Helps distinguish between the evals execution system (packages/evals, apps/web-evals) and the public website evals display page (apps/web-roo-code/src/app/evals).
attend
Route upstream epistemic deficits and evaluate execution-time risks during AI operations. Scans for unresolved upstream protocol needs, materializes intent into tasks, classifies each for risk signals, delegates low-risk tasks to executor, and surfaces elevated-risk findings for user judgment. Type: (ExecutionBlind, User, EVALUATE, ExecutionContext) â SituatedExecution. Alias: Prosoche(ÏÏοÏοÏή).
agent-reviews
Review and fix PR review bot findings on current PR, loop until resolved. Fetches unanswered bot comments, evaluates each finding, fixes real bugs, dismisses false positives, and replies to every comment with the outcome.
evidence-heavy-evaluator
Generate an evidence-first, read-only repository evaluation report with deterministic scoring and actionable recommendations. Use when the user asks to assess readiness, maintainability, release-readiness, documentation gaps, or engineering health and wants auditable artifacts (`json` + `markdown` + raw command logs).