llm-as-judge

Skills tagged with #llm-as-judge

pydantic-evals

Test and evaluate AI agents and LLM outputs using code-first evaluation framework with strong typing. Use when the user wants to: (1) Create evaluation datasets with test cases for AI agents, (2) Define evaluators (deterministic, LLM-as-Judge, custom, or span-based), (3) Run evaluations and generate reports, (4) Compare model performance across experiments, (5) Integrate evaluations with Pydantic AI agents, (6) Set up observability with Logfire, (7) Generate test datasets using LLMs, (8) Implement regression testing for AI systems.

Fuenfgeld/pydantic-ai-skills

14h ago

@NVIDIA-NeMo

byob

Create custom LLM evaluation benchmarks using the BYOB decorator framework. Use when the user wants to (1) create a new benchmark from a dataset, (2) pick or write a scorer, (3) compile and run a BYOB benchmark, (4) containerize a benchmark, or (5) use LLM-as-Judge evaluation. Triggers on mentions of BYOB, custom benchmark, bring your own benchmark, scorer, or benchmark compilation.

NVIDIA-NeMo/Evaluator+2 more

2mo ago

2290