Skills

All Skills

Benchmark Evaluation

Skills tagged with #Benchmark Evaluation

@AMD-AGI

magpie

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

AMD-AGI/Magpie
18d ago
450
@smixs

skill-conductor

Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.

smixs/skill-conductor
18d ago
430
@NVIDIA-NeMo

add-benchmark

Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".

NVIDIA-NeMo/Gym
18d ago
7220
@NVIDIA-NeMo

byob

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

NVIDIA-NeMo/Evaluator+2 more
18d ago
2290
@przeprogramowani

10x-eval-model

Set up and run benchmark evaluations for new LLM models in the 10xBench project. Use when the user wants to add a new model to the benchmark, prepare evaluation directories, update metadata, or launch evaluation runs. Triggers on phrases like "eval model", "add model to benchmark", "run benchmark for [model]", "evaluate [model-name]", "set up [model] for eval", or any request involving adding a new model to the Przeprogramowani.pl benchmark pipeline.

przeprogramowani/10x-bench+5 more
3d ago
50
@TenureAI

experiment-execution

Execute AI/ML experiments locally or remotely with environment, runtime, and logging controls. Prefer invoking via research-workflow. TRIGGER when: user asks to run/launch/start/resume/monitor a training job, evaluation, or benchmark, or a plan is ready for execution, or experiment needs rerun/recovery. DO NOT TRIGGER when: research investigation only (use deep-research), planning without execution (use research-plan), or env setup without launch (use project-context).

TenureAI/PhD-Zero+4 more
18d ago
390
@agentscope-ai

auto-arena

Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.

agentscope-ai/OpenJudge+8 more
18d ago
4720
@AgriciDaniel

ads-apple

Apple Search Ads (ASA) deep analysis for mobile app advertisers. Evaluates campaign structure, bid health, Creative Sets, MMP attribution, budget pacing, TAP coverage (Today/Search/Product Pages), and goal CPA benchmarks by country. Use when user says "Apple Search Ads", "ASA", "App Store ads", "Apple ads", "Search Ads", or is advertising a mobile app on iOS.

AgriciDaniel/claude-ads+18 more
18d ago
9600
@pinchbench

pinchbench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.

pinchbench/skill
18d ago
5470