Skills

Benchmarking

Skills tagged with #Benchmarking

cuda-kernels

Provides guidance for writing and benchmarking optimized CUDA kernels for NVIDIA GPUs (H100, A100, T4) targeting HuggingFace diffusers and transformers libraries. Supports models like LTX-Video, Stable Diffusion, LLaMA, Mistral, and Qwen. Includes integration with HuggingFace Kernels Hub (get_kernel) for loading pre-compiled kernels. Includes benchmarking scripts to compare kernel performance against baseline implementations.

huggingface/kernels

Io.Github.MetriLLM/Metrillm

Benchmark local LLM models — speed, quality & hardware fitness verdict from any MCP client

MetriLLM/metrillm

benchmark-translate

Run a quality benchmark of the /translate skill by selecting stratified test keys, capturing ground truth, translating, judging with sub-agents, and compiling a regression report. Invoke with /benchmark-translate.

shapeshift/web+4 more

perfup

Autonomous performance optimization: research, PoC, benchmark, implement, review, PR

raullenchai/Rapid-MLX

@towns-protocol

foundry-testing

ALWAYS load before writing or modifying Foundry test files (.t.sol). Covers fuzz testing, gas benchmarks, naming conventions, and test patterns.

towns-protocol/towns+2 more

skill-forge-benchmark

Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".

AgriciDaniel/skill-forge+6 more

magpie

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

@datafusion-contrib

ec2-cluster-provision

uses the code present in this repository for provision an EC2 cluster for benchmarking purposes

datafusion-contrib/datafusion-distributed+1 more

skill-conductor

Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.

smixs/skill-conductor

ATOM Pricing Intelligence

The Global Price Benchmark for AI Inference. 1,600+ SKUs, 40+ vendors, 14 price indexes.

A7OM-AI/atom-mcp-server

add-example-doc-model-env

Adds example documentation for a new model or environment in RLinf (RST pages in the docs gallery for both English and Chinese). Use when adding a new embodied or reasoning example, or new benchmark (e.g. LIBERO, ManiSkill).

RLinf/RLinf+3 more

add-benchmark

Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".

NVIDIA-NeMo/Gym

benchmark-context

Automatically benchmark your custom memory implementation against established systems like Supermemory. Set up a public benchmark, or create your own. Compare solutions against quality, latency, features and cost, easily, with a simple UI and CLI.

supermemoryai/memorybench

Io.Github.Nuance Dev/Rival

Query AI model benchmarks, pricing, and comparisons from rival.tips

nuance-dev/rival-mcp

golang-benchmark

Golang benchmarking, profiling, and performance measurement. Use when writing, running, or comparing Go benchmarks, profiling hot paths with pprof, interpreting CPU/memory/trace profiles, analyzing results with benchstat, setting up CI benchmark regression detection, or investigating production performance with Prometheus runtime metrics. Also use when the developer needs deep analysis on a specific performance indicator - this skill provides the measurement methodology, while golang-performance provides the optimization patterns.

samber/cc-skills-golang+30 more

bls-oews-api

Query the BLS Occupational Employment and Wage Statistics (OEWS) API for market wage data by occupation, geography, and industry. Trigger for any mention of BLS, Bureau of Labor Statistics, OEWS, OES, occupational wages, market wages, salary data, wage percentiles, median wage, mean wage, labor market rates, SOC codes, or geographic wage differentials. Also trigger when the user needs to compare wages across metro areas, benchmark contractor labor rates against market data, support IGCE development with market wage research, or validate price proposals against BLS data. Complements the GSA CALC+ skill (ceiling rates from awarded contracts) by providing independent market wage data from employer surveys. Together they form a complete pricing toolkit - BLS OEWS for what the market pays, CALC+ for what GSA contractors charge.

1102tools/federal-contracting-skills+19 more

@kromahlusenii-ops

ham

Set up and maintain Hierarchical Agent Memory (HAM) for Codex using scoped local memory files. Trigger on "go ham", "set up HAM", "ham commands", "ham help", "ham route", "ham remove", "ham update", "ham status", "ham benchmark", "ham baseline start", "ham baseline stop", "ham metrics clear", "HAM savings", "HAM stats", "HAM dashboard", "HAM sandwich", "HAM insights", "HAM carbon", or "HAM audit".

kromahlusenii-ops/ham+1 more

byob

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

NVIDIA-NeMo/Evaluator+2 more

Skill Score

One of the primary purposes of this package is to generate forecast methods as benchmarks.

alexhallam/tablespoon

build-and-compile

Building, compiling, and resolving dependency issues across languages

buildcompiledependenciesbenchmark

vstorm-co/pydantic-deepagents+8 more

Pensiata - Bulgarian Pension Fund Analytics

Bulgarian pension fund analytics — NAV data, metrics, rankings, and benchmarks.

nvasilevv/pensiata-mcp

benchmark-model

Benchmark inference performance for a specific model

trymirai/uzu+5 more

@opendataloader-project

bench

Run benchmark and analyze PDF parsing performance

opendataloader-project/opendataloader-pdf+1 more

@przeprogramowani

10x-eval-model

Set up and run benchmark evaluations for new LLM models in the 10xBench project. Use when the user wants to add a new model to the benchmark, prepare evaluation directories, update metadata, or launch evaluation runs. Triggers on phrases like "eval model", "add model to benchmark", "run benchmark for [model]", "evaluate [model-name]", "set up [model] for eval", or any request involving adding a new model to the Przeprogramowani.pl benchmark pipeline.

przeprogramowani/10x-bench+5 more

experiment-execution

Execute AI/ML experiments locally or remotely with environment, runtime, and logging controls. Prefer invoking via research-workflow. TRIGGER when: user asks to run/launch/start/resume/monitor a training job, evaluation, or benchmark, or a plan is ready for execution, or experiment needs rerun/recovery. DO NOT TRIGGER when: research investigation only (use deep-research), planning without execution (use research-plan), or env setup without launch (use project-context).

TenureAI/PhD-Zero+4 more

best-practices

Industry best practices audit. Researches best practices (via WebSearch or agent-browser), analyzes current implementation, then uses /codex-brainstorm adversarial debate to confirm alignment. Outputs gap analysis report. Use when: auditing against industry standards, checking compliance with best practices, benchmarking implementation quality. Not for: code review (use /codex-review), architecture design (use /codex-architect).

sd0xdev/sd0x-dev-flow+72 more

Io.Github.Daichi Kudo/Llm Advisor

Real-time LLM/VLM benchmarks, pricing, and recommendations. 336+ models, 5 sources.

Daichi-Kudo/llm-advisor-mcp

bench-fast

Run a benchmark with minimal iterations to generate assembly code via BenchmarkDotNet's DisassemblyDiagnoser. Use to generate one big assembly code file per benchmark.

bitfaster/BitFaster.Caching+2 more

@danielmiessler

knowledge-worker-salaries

Comprehensive global knowledge worker salary data with total market value calculations, sector breakdowns, geographic comparisons, and authoritative sources. USE WHEN discussing knowledge worker compensation, salary benchmarking, economic analysis of professional labor markets, or AI impact on wages.

danielmiessler/Substrate

Autoresearch: Autonomous Experiment Loop

An autonomous optimization loop where Claude edits code, runs a benchmark, measures a metric, and keeps improvements or reverts â repeating forever until stopped. Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) and [pi-autoresearch](https://github.com/davebcn87/pi-a

proyecto26/autoresearch-ai-plugin+1 more

auto-arena

Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.

agentscope-ai/OpenJudge+8 more

Complexity Audit

Runs a four-phase complexity audit (measure, identify patterns, benchmark, draft PR) on a specified target path. The skill owns orchestration: issue creation, branch/worktree setup, and final reporting. The `complexity-auditor` agent owns execution.

ruska-ai/orchestra+5 more

Add a benchmark scenario

`internal/bench/` is remindb's external-facing benchmark â it compares "how many tokens does an agent consume to do X via remindb tools" against "how many would they consume doing X with `grep` / `cat` / `find`". The output is a token-savings table rendered via `text/tabwriter`. It's invoked from

radimsem/remindb+4 more

ads-apple

Apple Search Ads (ASA) deep analysis for mobile app advertisers. Evaluates campaign structure, bid health, Creative Sets, MMP attribution, budget pacing, TAP coverage (Today/Search/Product Pages), and goal CPA benchmarks by country. Use when user says "Apple Search Ads", "ASA", "App Store ads", "Apple ads", "Search Ads", or is advertising a mobile app on iOS.

AgriciDaniel/claude-ads+18 more

codspeed-optimize

Autonomously optimize code for performance using CodSpeed benchmarks, flamegraph analysis, and iterative improvement. Use this skill whenever the user wants to make code faster, reduce CPU usage, optimize memory, improve throughput, find performance bottlenecks, or asks to 'optimize', 'speed up', 'make faster', 'reduce latency', 'improve performance', or points at a CodSpeed benchmark result wanting improvements. Also trigger when the user mentions a slow function, a regression, or wants to understand where time is spent in their code.

CodSpeedHQ/codspeed+1 more

Io.Github.Rswcf/Deepviews

Free financial data: company analysis, DCF, comps, benchmarks, screening. No API key.

Rswcf/v0-valuation-tool

@keep-starknet-strange

benchmarking-cairo

Use when profiling Cairo functions, measuring step counts, analyzing resource usage, generating call-graph PNGs, or launching pprof to visualize Cairo execution traces

keep-starknet-strange/garaga+1 more

@reidemeister94

ai-agent-bench

Use when the user wants to benchmark or compare AI agents (Claude Code, Codex, OpenCode) on a refactoring, perf, or code-change task in the current repo. Use when user says compare agents, benchmark Claude vs Codex, agent eval, measure agent, AI agent comparison, agent trial, /ai-agent-bench.

reidemeister94/development-skills+15 more

pinchbench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.

pinchbench/skill

turbo-benchmark

Run performance benchmarks for TurboAPI. Use when testing performance, checking for regressions, or comparing against FastAPI.

justrach/turboAPI+4 more

Io.Github.Rog0x/Perf

Benchmark, memory, Big O analysis for AI agents

mcpgithubaimemory

rog0x/mcp-perf-tools

benchmark

Run scalex performance benchmarks, profiling, and timing analysis. Use this skill whenever the user asks to benchmark scalex, measure performance, profile index/query times, compare before/after performance of a change, investigate bottlenecks, or mentions "benchmark", "perf", "how fast", "timing", "hyperfine", "profile", "flame graph", "profiling", "--timings", "slow", "bottleneck", "regression", "memory", "heap", "GC", "allocation". Also use proactively after implementing performance improvements to verify gains. Covers 6 layers: built-in --timings, hyperfine benchmarks, async-profiler flame graphs, JFR recording, microbenchmarks, and memory profiling.

nguyenyou/scalex+2 more

benchmark-docs

Manages benchmark documentation across multiple platforms. Use when updating benchmark results, adding performance data, or documenting jq comparison benchmarks. Triggers on terms like "benchmark", "performance", "jq comparison", "benchmark results", "update benchmarks".

rust-works/succinctly+8 more

Workflows MCP Collections Publishers Docs