cuda-kernels
Provides guidance for writing and benchmarking optimized CUDA kernels for NVIDIA GPUs (H100, A100, T4) targeting HuggingFace diffusers and transformers libraries. Supports models like LTX-Video, Stable Diffusion, LLaMA, Mistral, and Qwen. Includes integration with HuggingFace Kernels Hub (get_kernel) for loading pre-compiled kernels. Includes benchmarking scripts to compare kernel performance against baseline implementations.
magpie
Performs GPU kernel correctness and performance evaluation and LLM inference benchmarking with Magpie. Analyzes single or multiple kernels (HIP/CUDA/PyTorch), compares kernel implementations, runs vLLM/SGLang benchmarks with profiling and TraceLens, and runs gap analysis on torch traces. Creates kernel config YAMLs, discovers kernels in a project, and queries GPU specs. Use when the user mentions Magpie, kernel analyze or compare, HIP/CUDA kernel evaluation, vLLM/SGLang benchmark, gap analysis, TraceLens, creating kernel configs, or discovering GPU kernels.
agentic-data-science-competition
AI Agent-driven Kaggle competition workflow. Learn from real competition experience: score stabilization patterns, submission troubleshooting, kernel workflows, GPU task delegation, and the spec-driven development approach that achieved top leaderboard positions. Use when: working on any Kaggle competition, analyzing submission failures, setting up automated pipelines, or replicating top notebook solutions.
diffusion-perf
Deprecated alias (merged into diffusion-kernel).
ipsw
Apple firmware and binary reverse engineering with the ipsw CLI tool. Use when analyzing iOS/macOS binaries, disassembling functions in dyld_shared_cache, dumping Objective-C headers from private frameworks, downloading IPSWs or kernelcaches, extracting entitlements, analyzing Mach-O files, or researching Apple security. Triggers on requests involving Apple RE, iOS internals, kernel analysis, KEXT extraction, or vulnerability research on Apple platforms.
awq-quantization
Activation-aware weight quantization for 4-bit LLM compression with 3x speedup and minimal accuracy loss. Use when deploying large models (7B-70B) on limited GPU memory, when you need faster inference than GPTQ with better accuracy preservation, or for instruction-tuned and multimodal models. MLSys 2024 Best Paper Award winner.
Io.Github.RightNow AI/Forge Mcp Server
Turn PyTorch into fast CUDA/Triton kernels on real datacenter GPUs with up to 14x speedup.
Kernel Mcp Server
Access Kernel's cloud-based browsers and app actions via MCP (remote HTTP + OAuth).
add-jit-kernel
Step-by-step tutorial for adding a new lightweight JIT CUDA kernel to sglang's jit_kernel module
anti-cheat-systems
Guide for modern game anti-cheat architecture, Windows kernel monitoring, and detection tradeoffs. Use this skill when analyzing EAC, BattlEye, Vanguard, FACEIT AC, kernel callbacks, handle protection, manual-map detection, boot-start drivers, BYOVD, DMA threats, or behavioral telemetry in game security research.
3dgs-code-reviewer
Review 3D Gaussian Splatting implementation code for correctness, performance bugs, and best practices. Covers CUDA kernels, rendering pipeline, training loop, loss functions, and common pitfalls. Detects 42+ known bug patterns.
add-reference-tests
Add pytest tests to validate reference implementations in flashinfer_trace against FlashInfer or SGLang ground truth. Use when validating kernel definitions, adding tests for new op_types, or verifying reference implementations are correct.
HelioSPICE
Spacecraft ephemeris made easy — auto-managed SPICE kernels for heliophysics
channel-telegram
Run and operate a Telegram channel worker for LionClaw using the kernel channel bridge APIs.
kernel-analysis-vphone600
Analyze vphone600 kernel artifacts using the local symbol database and XNU source tree. Use when working on kernel reverse engineering, address-to-symbol lookup, release-vs-research kernel comparison, or patch analysis for vphone600 variants in this repository.
XHelio-SPICE
Spacecraft ephemeris made easy — auto-managed SPICE kernels for heliophysics
kernelgen-flagos
Unified GPU kernel operator generation skill. Automatically detects the target repository type (FlagGems, vLLM, or general Python/Triton) and dispatches to the appropriate specialized sub-skill. Also includes a feedback submission sub-skill for bug reports. Use this skill when the user wants to generate a GPU kernel operator, create a Triton kernel, or says things like "generate an operator", "create a kernel for X", or "/kernelgen-flagos". This single skill replaces the need to install kernelgen-general, kernelgen-for-flaggems, kernelgen-for-vllm, and kernelgen-submit-feedback separately.