cuda-kernels
Provides guidance for writing and benchmarking optimized CUDA kernels for NVIDIA GPUs (H100, A100, T4) targeting HuggingFace diffusers and transformers libraries. Supports models like LTX-Video, Stable Diffusion, LLaMA, Mistral, and Qwen. Includes integration with HuggingFace Kernels Hub (get_kernel) for loading pre-compiled kernels. Includes benchmarking scripts to compare kernel performance against baseline implementations.
Io.Github.MetriLLM/Metrillm
Benchmark local LLM models — speed, quality & hardware fitness verdict from any MCP client
benchmark-translate
Run a quality benchmark of the /translate skill by selecting stratified test keys, capturing ground truth, translating, judging with sub-agents, and compiling a regression report. Invoke with /benchmark-translate.
perfup
Autonomous performance optimization: research, PoC, benchmark, implement, review, PR
foundry-testing
ALWAYS load before writing or modifying Foundry test files (.t.sol). Covers fuzz testing, gas benchmarks, naming conventions, and test patterns.
skill-forge-benchmark
Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".
magpie
Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.
ec2-cluster-provision
uses the code present in this repository for provision an EC2 cluster for benchmarking purposes
skill-conductor
Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.
ATOM Pricing Intelligence
The Global Price Benchmark for AI Inference. 1,600+ SKUs, 40+ vendors, 14 price indexes.
add-example-doc-model-env
Adds example documentation for a new model or environment in RLinf (RST pages in the docs gallery for both English and Chinese). Use when adding a new embodied or reasoning example, or new benchmark (e.g. LIBERO, ManiSkill).
add-benchmark
Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".
benchmark-context
Automatically benchmark your custom memory implementation against established systems like Supermemory. Set up a public benchmark, or create your own. Compare solutions against quality, latency, features and cost, easily, with a simple UI and CLI.
Io.Github.Nuance Dev/Rival
Query AI model benchmarks, pricing, and comparisons from rival.tips
golang-benchmark
Golang benchmarking, profiling, and performance measurement. Use when writing, running, or comparing Go benchmarks, profiling hot paths with pprof, interpreting CPU/memory/trace profiles, analyzing results with benchstat, setting up CI benchmark regression detection, or investigating production performance with Prometheus runtime metrics. Also use when the developer needs deep analysis on a specific performance indicator - this skill provides the measurement methodology, while golang-performance provides the optimization patterns.
bls-oews-api
Query the BLS Occupational Employment and Wage Statistics (OEWS) API for market wage data by occupation, geography, and industry. Trigger for any mention of BLS, Bureau of Labor Statistics, OEWS, OES, occupational wages, market wages, salary data, wage percentiles, median wage, mean wage, labor market rates, SOC codes, or geographic wage differentials. Also trigger when the user needs to compare wages across metro areas, benchmark contractor labor rates against market data, support IGCE development with market wage research, or validate price proposals against BLS data. Complements the GSA CALC+ skill (ceiling rates from awarded contracts) by providing independent market wage data from employer surveys. Together they form a complete pricing toolkit - BLS OEWS for what the market pays, CALC+ for what GSA contractors charge.
ham
Set up and maintain Hierarchical Agent Memory (HAM) for Codex using scoped local memory files. Trigger on "go ham", "set up HAM", "ham commands", "ham help", "ham route", "ham remove", "ham update", "ham status", "ham benchmark", "ham baseline start", "ham baseline stop", "ham metrics clear", "HAM savings", "HAM stats", "HAM dashboard", "HAM sandwich", "HAM insights", "HAM carbon", or "HAM audit".
byob
Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.
Skill Score
One of the primary purposes of this package is to generate forecast methods as benchmarks.
build-and-compile
Building, compiling, and resolving dependency issues across languages
Pensiata - Bulgarian Pension Fund Analytics
Bulgarian pension fund analytics — NAV data, metrics, rankings, and benchmarks.
benchmark-model
Benchmark inference performance for a specific model
bench
Run benchmark and analyze PDF parsing performance
10x-eval-model
Set up and run benchmark evaluations for new LLM models in the 10xBench project. Use when the user wants to add a new model to the benchmark, prepare evaluation directories, update metadata, or launch evaluation runs. Triggers on phrases like "eval model", "add model to benchmark", "run benchmark for [model]", "evaluate [model-name]", "set up [model] for eval", or any request involving adding a new model to the Przeprogramowani.pl benchmark pipeline.
experiment-execution
Execute AI/ML experiments locally or remotely with environment, runtime, and logging controls. Prefer invoking via research-workflow. TRIGGER when: user asks to run/launch/start/resume/monitor a training job, evaluation, or benchmark, or a plan is ready for execution, or experiment needs rerun/recovery. DO NOT TRIGGER when: research investigation only (use deep-research), planning without execution (use research-plan), or env setup without launch (use project-context).
best-practices
Industry best practices audit. Researches best practices (via WebSearch or agent-browser), analyzes current implementation, then uses /codex-brainstorm adversarial debate to confirm alignment. Outputs gap analysis report. Use when: auditing against industry standards, checking compliance with best practices, benchmarking implementation quality. Not for: code review (use /codex-review), architecture design (use /codex-architect).
Io.Github.Daichi Kudo/Llm Advisor
Real-time LLM/VLM benchmarks, pricing, and recommendations. 336+ models, 5 sources.
bench-fast
Run a benchmark with minimal iterations to generate assembly code via BenchmarkDotNet's DisassemblyDiagnoser. Use to generate one big assembly code file per benchmark.
knowledge-worker-salaries
Comprehensive global knowledge worker salary data with total market value calculations, sector breakdowns, geographic comparisons, and authoritative sources. USE WHEN discussing knowledge worker compensation, salary benchmarking, economic analysis of professional labor markets, or AI impact on wages.
Autoresearch: Autonomous Experiment Loop
An autonomous optimization loop where Claude edits code, runs a benchmark, measures a metric, and keeps improvements or reverts â repeating forever until stopped. Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) and [pi-autoresearch](https://github.com/davebcn87/pi-a
auto-arena
Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.
Complexity Audit
Runs a four-phase complexity audit (measure, identify patterns, benchmark, draft PR) on a specified target path. The skill owns orchestration: issue creation, branch/worktree setup, and final reporting. The `complexity-auditor` agent owns execution.
Add a benchmark scenario
`internal/bench/` is remindb's external-facing benchmark â it compares "how many tokens does an agent consume to do X via remindb tools" against "how many would they consume doing X with `grep` / `cat` / `find`". The output is a token-savings table rendered via `text/tabwriter`. It's invoked from
ads-apple
Apple Search Ads (ASA) deep analysis for mobile app advertisers. Evaluates campaign structure, bid health, Creative Sets, MMP attribution, budget pacing, TAP coverage (Today/Search/Product Pages), and goal CPA benchmarks by country. Use when user says "Apple Search Ads", "ASA", "App Store ads", "Apple ads", "Search Ads", or is advertising a mobile app on iOS.
codspeed-optimize
Autonomously optimize code for performance using CodSpeed benchmarks, flamegraph analysis, and iterative improvement. Use this skill whenever the user wants to make code faster, reduce CPU usage, optimize memory, improve throughput, find performance bottlenecks, or asks to 'optimize', 'speed up', 'make faster', 'reduce latency', 'improve performance', or points at a CodSpeed benchmark result wanting improvements. Also trigger when the user mentions a slow function, a regression, or wants to understand where time is spent in their code.
Io.Github.Rswcf/Deepviews
Free financial data: company analysis, DCF, comps, benchmarks, screening. No API key.
benchmarking-cairo
Use when profiling Cairo functions, measuring step counts, analyzing resource usage, generating call-graph PNGs, or launching pprof to visualize Cairo execution traces
ai-agent-bench
Use when the user wants to benchmark or compare AI agents (Claude Code, Codex, OpenCode) on a refactoring, perf, or code-change task in the current repo. Use when user says compare agents, benchmark Claude vs Codex, agent eval, measure agent, AI agent comparison, agent trial, /ai-agent-bench.
pinchbench
Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.
turbo-benchmark
Run performance benchmarks for TurboAPI. Use when testing performance, checking for regressions, or comparing against FastAPI.
Io.Github.Rog0x/Perf
Benchmark, memory, Big O analysis for AI agents
benchmark
Run scalex performance benchmarks, profiling, and timing analysis. Use this skill whenever the user asks to benchmark scalex, measure performance, profile index/query times, compare before/after performance of a change, investigate bottlenecks, or mentions "benchmark", "perf", "how fast", "timing", "hyperfine", "profile", "flame graph", "profiling", "--timings", "slow", "bottleneck", "regression", "memory", "heap", "GC", "allocation". Also use proactively after implementing performance improvements to verify gains. Covers 6 layers: built-in --timings, hyperfine benchmarks, async-profiler flame graphs, JFR recording, microbenchmarks, and memory profiling.
benchmark-docs
Manages benchmark documentation across multiple platforms. Use when updating benchmark results, adding performance data, or documenting jq comparison benchmarks. Triggers on terms like "benchmark", "performance", "jq comparison", "benchmark results", "update benchmarks".