Skills

All Skills

evals

Skills tagged with #evals

@castorini

Umbrela Eval

Analyze and compare umbrela evaluation results across backends, models, and configurations.

castorini/umbrela+3 more
18d ago
540
@iris-eval
MCP

Io.Github.Iris Eval/Mcp Server

The agent eval standard for MCP. Score every agent output for quality, safety, and cost.

mcpgithub
iris-eval/mcp-server
19d ago
0
@AgriciDaniel

skill-forge-benchmark

Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".

AgriciDaniel/skill-forge+6 more
18d ago
420
@smixs

skill-conductor

Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.

smixs/skill-conductor
18d ago
430
@dbsectrainer
MCP

Mcp Eval Runner

A standardized testing harness for MCP servers and agent workflows

mcpgithubai
dbsectrainer/mcp-eval-runner
19d ago
0
@NVIDIA-NeMo

add-benchmark

Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".

NVIDIA-NeMo/Gym
18d ago
7220
@RefoundAI

ai-evals

Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.

RefoundAI/lenny-skills+73 more
18d ago
4230
@RooCodeInc

evals-context

Provides context about the Roo Code evals system structure in this monorepo. Use when tasks mention "evals", "evaluation", "eval runs", "eval exercises", or working with the evals infrastructure. Helps distinguish between the evals execution system (packages/evals, apps/web-evals) and the public website evals display page (apps/web-roo-code/src/app/evals).

RooCodeInc/Roo-Code+2 more
18d ago
22.7K0
@przeprogramowani

10x-eval-model

Set up and run benchmark evaluations for new LLM models in the 10xBench project. Use when the user wants to add a new model to the benchmark, prepare evaluation directories, update metadata, or launch evaluation runs. Triggers on phrases like "eval model", "add model to benchmark", "run benchmark for [model]", "evaluate [model-name]", "set up [model] for eval", or any request involving adding a new model to the Przeprogramowani.pl benchmark pipeline.

przeprogramowani/10x-bench+5 more
3d ago
50
@openai

bootstrap-realtime-eval

Bootstrap a new realtime eval folder inside this cookbook repo by choosing the right harness from examples/evals/realtime_evals, scaffolding prompt/tools/data files, generating a useful README, and validating it with smoke, full eval, and test runs. Use when a user wants to start a new crawl, walk, or run realtime eval in this repository.

openai/openai-cookbook
18d ago
72.1K0
@salespeak-ai

buyer-eval

Structured B2B software vendor evaluation for buyers. Researches your company, asks domain-expert questions, engages vendor AI agents via the Salespeak Frontdoor API, scores vendors across 7 dimensions, and produces a comparative recommendation with evidence transparency. Use when asked to evaluate, compare, or research B2B software vendors.

salespeak-ai/buyer-eval-skill
18d ago
490
@dcramer

agent-design-review

Designs, reviews, and iterates on LLM agents and agent-like workflows. Use when asked to "design an agent", "review this agent", "improve our system prompt", "optimize prompts for caching", "improve tool calling", "reduce hallucinated tool calls", "add structured outputs", "decide if this should be multi-agent", "reduce false positives", "tune agent thresholds", or "build evals for this agent". Covers architecture choice, cache-friendly prompt templates, tool and schema design, runtime loops, trust boundaries, and eval-driven iteration.

dcramer/peated+1 more
19d ago
900
@PKU-Alignment

VLA Remote Train Eval

PKU-Alignment/VLA-Arena
18d ago
1320
@letta-ai

analyzing-eval-errors

Investigate errors in letta_evals runs by parsing results JSONL, cross-referencing agent and run state on the Letta server via the Python SDK, and producing structured error reports. Use when an eval run has errors, crashes, or unexpected failures that need diagnosis.

letta-ai/letta-evals
18d ago
600
@reidemeister94

ai-agent-bench

Use when the user wants to benchmark or compare AI agents (Claude Code, Codex, OpenCode) on a refactoring, perf, or code-change task in the current repo. Use when user says compare agents, benchmark Claude vs Codex, agent eval, measure agent, AI agent comparison, agent trial, /ai-agent-bench.

reidemeister94/development-skills+15 more
10d ago
90
@haoxiang-xu

pupu-test-api

Use when running QA / regression tests against PuPu, when verifying a code change actually works in the running app, or when reading PuPu UI/state without screenshotting manually. Triggers on tasks like "test that PuPu still creates chats correctly", "verify the new model selector works end-to-end", "send a message and check the response", "what's PuPu's current state?". Phase 1 covers chat lifecycle, message send (blocking), model/toolkit/character switching, logs, state snapshot, screenshot, eval.

haoxiang-xu/PuPu
12d ago
270
@iliaal

evolve-skill

Run the full skill evolution pipeline -- harvest sessions, discover signals, build golden dataset, eval baseline, evolve via DSPy, compare scores

iliaal/compound-engineering-plugin+24 more
12d ago
50