Skills

All Skills

Benchmarking

Skills tagged with #Benchmarking

@huggingface

cuda-kernels

Provides guidance for writing and benchmarking optimized CUDA kernels for NVIDIA GPUs (H100, A100, T4) targeting HuggingFace diffusers and transformers libraries. Supports models like LTX-Video, Stable Diffusion, LLaMA, Mistral, and Qwen. Includes integration with HuggingFace Kernels Hub (get_kernel) for loading pre-compiled kernels. Includes benchmarking scripts to compare kernel performance against baseline implementations.

huggingface/kernels
18d ago
4980
@MetriLLM
MCP

Io.Github.MetriLLM/Metrillm

Benchmark local LLM models — speed, quality & hardware fitness verdict from any MCP client

mcpgithubllm
MetriLLM/metrillm
19d ago
0
@shapeshift

benchmark-translate

Run a quality benchmark of the /translate skill by selecting stratified test keys, capturing ground truth, translating, judging with sub-agents, and compiling a regression report. Invoke with /benchmark-translate.

shapeshift/web+4 more
18d ago
2010
@raullenchai

perfup

Autonomous performance optimization: research, PoC, benchmark, implement, review, PR

raullenchai/Rapid-MLX
3d ago
1.6K0
@towns-protocol

foundry-testing

ALWAYS load before writing or modifying Foundry test files (.t.sol). Covers fuzz testing, gas benchmarks, naming conventions, and test patterns.

towns-protocol/towns+2 more
18d ago
660
@AgriciDaniel

skill-forge-benchmark

Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".

AgriciDaniel/skill-forge+6 more
18d ago
420
@AMD-AGI

magpie

Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.

AMD-AGI/Magpie
18d ago
450
@datafusion-contrib

ec2-cluster-provision

uses the code present in this repository for provision an EC2 cluster for benchmarking purposes

datafusion-contrib/datafusion-distributed+1 more
19d ago
710
@smixs

skill-conductor

Create, edit, evaluate, and package agent skills. Use when building a new skill from scratch, improving an existing skill, running evals to test a skill, benchmarking skill performance, optimizing a skill's description for better triggering, reviewing third-party skills for quality, or packaging skills for distribution. Not for using skills or general coding tasks.

smixs/skill-conductor
18d ago
430
@A7OM-AI
MCP

ATOM Pricing Intelligence

The Global Price Benchmark for AI Inference. 1,600+ SKUs, 40+ vendors, 14 price indexes.

mcpgithubai
A7OM-AI/atom-mcp-server
19d ago
0
@RLinf

add-example-doc-model-env

Adds example documentation for a new model or environment in RLinf (RST pages in the docs gallery for both English and Chinese). Use when adding a new embodied or reasoning example, or new benchmark (e.g. LIBERO, ManiSkill).

RLinf/RLinf+3 more
19d ago
2.8K0
@NVIDIA-NeMo

add-benchmark

Guide for adding a new benchmark or training environment to NeMo-Gym. Use when the user asks to add, create, or integrate a benchmark, evaluation, training environment, or resources server into NeMo-Gym. Also use when wrapping an existing 3rd-party benchmark library. Covers the full workflow: data preparation, resources server implementation, agent wiring, YAML config, testing, and reward profiling (baselining). Triggered by: "add benchmark", "new resources server", "integrate benchmark", "wrap benchmark", "add training environment", "add eval".

NVIDIA-NeMo/Gym
18d ago
7220
@supermemoryai

benchmark-context

Automatically benchmark your custom memory implementation against established systems like Supermemory. Set up a public benchmark, or create your own. Compare solutions against quality, latency, features and cost, easily, with a simple UI and CLI.

supermemoryai/memorybench
18d ago
1930
@nuance-dev
MCP

Io.Github.Nuance Dev/Rival

Query AI model benchmarks, pricing, and comparisons from rival.tips

mcpgithubai
nuance-dev/rival-mcp
19d ago
0
@samber

golang-benchmark

Golang benchmarking, profiling, and performance measurement. Use when writing, running, or comparing Go benchmarks, profiling hot paths with pprof, interpreting CPU/memory/trace profiles, analyzing results with benchstat, setting up CI benchmark regression detection, or investigating production performance with Prometheus runtime metrics. Also use when the developer needs deep analysis on a specific performance indicator - this skill provides the measurement methodology, while golang-performance provides the optimization patterns.

samber/cc-skills-golang+30 more
8d ago
120
@1102tools

bls-oews-api

Query the BLS Occupational Employment and Wage Statistics (OEWS) API for market wage data by occupation, geography, and industry. Trigger for any mention of BLS, Bureau of Labor Statistics, OEWS, OES, occupational wages, market wages, salary data, wage percentiles, median wage, mean wage, labor market rates, SOC codes, or geographic wage differentials. Also trigger when the user needs to compare wages across metro areas, benchmark contractor labor rates against market data, support IGCE development with market wage research, or validate price proposals against BLS data. Complements the GSA CALC+ skill (ceiling rates from awarded contracts) by providing independent market wage data from employer surveys. Together they form a complete pricing toolkit - BLS OEWS for what the market pays, CALC+ for what GSA contractors charge.

1102tools/federal-contracting-skills+19 more
18d ago
60
@kromahlusenii-ops

ham

Set up and maintain Hierarchical Agent Memory (HAM) for Codex using scoped local memory files. Trigger on "go ham", "set up HAM", "ham commands", "ham help", "ham route", "ham remove", "ham update", "ham status", "ham benchmark", "ham baseline start", "ham baseline stop", "ham metrics clear", "HAM savings", "HAM stats", "HAM dashboard", "HAM sandwich", "HAM insights", "HAM carbon", or "HAM audit".

kromahlusenii-ops/ham+1 more
18d ago
280
@NVIDIA-NeMo

byob

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

NVIDIA-NeMo/Evaluator+2 more
18d ago
2290
@alexhallam

Skill Score

One of the primary purposes of this package is to generate forecast methods as benchmarks.

alexhallam/tablespoon
18d ago
410
@vstorm-co

build-and-compile

Building, compiling, and resolving dependency issues across languages

buildcompiledependenciesbenchmark
vstorm-co/pydantic-deepagents+8 more
18d ago
4510
@nvasilevv
MCP

Pensiata - Bulgarian Pension Fund Analytics

Bulgarian pension fund analytics — NAV data, metrics, rankings, and benchmarks.

mcpgithub
nvasilevv/pensiata-mcp
19d ago
0
@trymirai

benchmark-model

Benchmark inference performance for a specific model

trymirai/uzu+5 more
18d ago
1.5K0
@opendataloader-project

bench

Run benchmark and analyze PDF parsing performance

opendataloader-project/opendataloader-pdf+1 more
19d ago
2.0K0
@przeprogramowani

10x-eval-model

Set up and run benchmark evaluations for new LLM models in the 10xBench project. Use when the user wants to add a new model to the benchmark, prepare evaluation directories, update metadata, or launch evaluation runs. Triggers on phrases like "eval model", "add model to benchmark", "run benchmark for [model]", "evaluate [model-name]", "set up [model] for eval", or any request involving adding a new model to the Przeprogramowani.pl benchmark pipeline.

przeprogramowani/10x-bench+5 more
3d ago
50
@TenureAI

experiment-execution

Execute AI/ML experiments locally or remotely with environment, runtime, and logging controls. Prefer invoking via research-workflow. TRIGGER when: user asks to run/launch/start/resume/monitor a training job, evaluation, or benchmark, or a plan is ready for execution, or experiment needs rerun/recovery. DO NOT TRIGGER when: research investigation only (use deep-research), planning without execution (use research-plan), or env setup without launch (use project-context).

TenureAI/PhD-Zero+4 more
18d ago
390
@sd0xdev

best-practices

Industry best practices audit. Researches best practices (via WebSearch or agent-browser), analyzes current implementation, then uses /codex-brainstorm adversarial debate to confirm alignment. Outputs gap analysis report. Use when: auditing against industry standards, checking compliance with best practices, benchmarking implementation quality. Not for: code review (use /codex-review), architecture design (use /codex-architect).

sd0xdev/sd0x-dev-flow+72 more
19d ago
890
@Daichi-Kudo
MCP

Io.Github.Daichi Kudo/Llm Advisor

Real-time LLM/VLM benchmarks, pricing, and recommendations. 336+ models, 5 sources.

mcpgithubaillm
Daichi-Kudo/llm-advisor-mcp
19d ago
0
@bitfaster

bench-fast

Run a benchmark with minimal iterations to generate assembly code via BenchmarkDotNet's DisassemblyDiagnoser. Use to generate one big assembly code file per benchmark.

bitfaster/BitFaster.Caching+2 more
18d ago
5880
@danielmiessler

knowledge-worker-salaries

Comprehensive global knowledge worker salary data with total market value calculations, sector breakdowns, geographic comparisons, and authoritative sources. USE WHEN discussing knowledge worker compensation, salary benchmarking, economic analysis of professional labor markets, or AI impact on wages.

danielmiessler/Substrate
18d ago
7610
@proyecto26

Autoresearch: Autonomous Experiment Loop

An autonomous optimization loop where Claude edits code, runs a benchmark, measures a metric, and keeps improvements or reverts — repeating forever until stopped. Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) and [pi-autoresearch](https://github.com/davebcn87/pi-a

proyecto26/autoresearch-ai-plugin+1 more
18d ago
50
@agentscope-ai

auto-arena

Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.

agentscope-ai/OpenJudge+8 more
18d ago
4720
@ruska-ai

Complexity Audit

Runs a four-phase complexity audit (measure, identify patterns, benchmark, draft PR) on a specified target path. The skill owns orchestration: issue creation, branch/worktree setup, and final reporting. The `complexity-auditor` agent owns execution.

ruska-ai/orchestra+5 more
18d ago
120
@radimsem

Add a benchmark scenario

`internal/bench/` is remindb's external-facing benchmark — it compares "how many tokens does an agent consume to do X via remindb tools" against "how many would they consume doing X with `grep` / `cat` / `find`". The output is a token-savings table rendered via `text/tabwriter`. It's invoked from

radimsem/remindb+4 more
1h ago
460
@AgriciDaniel

ads-apple

Apple Search Ads (ASA) deep analysis for mobile app advertisers. Evaluates campaign structure, bid health, Creative Sets, MMP attribution, budget pacing, TAP coverage (Today/Search/Product Pages), and goal CPA benchmarks by country. Use when user says "Apple Search Ads", "ASA", "App Store ads", "Apple ads", "Search Ads", or is advertising a mobile app on iOS.

AgriciDaniel/claude-ads+18 more
18d ago
9600
@CodSpeedHQ

codspeed-optimize

Autonomously optimize code for performance using CodSpeed benchmarks, flamegraph analysis, and iterative improvement. Use this skill whenever the user wants to make code faster, reduce CPU usage, optimize memory, improve throughput, find performance bottlenecks, or asks to 'optimize', 'speed up', 'make faster', 'reduce latency', 'improve performance', or points at a CodSpeed benchmark result wanting improvements. Also trigger when the user mentions a slow function, a regression, or wants to understand where time is spent in their code.

CodSpeedHQ/codspeed+1 more
19d ago
1060
@Rswcf
MCP

Io.Github.Rswcf/Deepviews

Free financial data: company analysis, DCF, comps, benchmarks, screening. No API key.

mcpgithubapi
Rswcf/v0-valuation-tool
19d ago
0
@keep-starknet-strange

benchmarking-cairo

Use when profiling Cairo functions, measuring step counts, analyzing resource usage, generating call-graph PNGs, or launching pprof to visualize Cairo execution traces

keep-starknet-strange/garaga+1 more
18d ago
2570
@reidemeister94

ai-agent-bench

Use when the user wants to benchmark or compare AI agents (Claude Code, Codex, OpenCode) on a refactoring, perf, or code-change task in the current repo. Use when user says compare agents, benchmark Claude vs Codex, agent eval, measure agent, AI agent comparison, agent trial, /ai-agent-bench.

reidemeister94/development-skills+15 more
10d ago
90
@pinchbench

pinchbench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.

pinchbench/skill
18d ago
5470
@justrach

turbo-benchmark

Run performance benchmarks for TurboAPI. Use when testing performance, checking for regressions, or comparing against FastAPI.

justrach/turboAPI+4 more
18d ago
470
@rog0x
MCP

Io.Github.Rog0x/Perf

Benchmark, memory, Big O analysis for AI agents

mcpgithubaimemory
rog0x/mcp-perf-tools
19d ago
0
@nguyenyou

benchmark

Run scalex performance benchmarks, profiling, and timing analysis. Use this skill whenever the user asks to benchmark scalex, measure performance, profile index/query times, compare before/after performance of a change, investigate bottlenecks, or mentions "benchmark", "perf", "how fast", "timing", "hyperfine", "profile", "flame graph", "profiling", "--timings", "slow", "bottleneck", "regression", "memory", "heap", "GC", "allocation". Also use proactively after implementing performance improvements to verify gains. Covers 6 layers: built-in --timings, hyperfine benchmarks, async-profiler flame graphs, JFR recording, microbenchmarks, and memory profiling.

nguyenyou/scalex+2 more
18d ago
470
@rust-works

benchmark-docs

Manages benchmark documentation across multiple platforms. Use when updating benchmark results, adding performance data, or documenting jq comparison benchmarks. Triggers on terms like "benchmark", "performance", "jq comparison", "benchmark results", "update benchmarks".

rust-works/succinctly+8 more
18d ago
450