📦 Advanced Evaluation — LLM评估

v1.0.0

一键实现 LLM-as-judge，自动对比多模型输出、生成评分标准、检测并缓解评估偏差，输出可视化报告，让模型选型与效果追踪更高效。

0· 95·1 当前·1 累计

by @karmaent (KarmaENT)

AI模型访问数据分析测试工具自动化

下载技能包

最后更新

2026/4/1

安全扫描

VirusTotal

无害

查看报告

OpenClaw

安全

high confidence

The skill is internally coherent: its instructions, lack of installs, and lack of required credentials line up with its stated purpose of providing evaluation patterns and prompt templates for LLM-as-judge workflows.

评估建议

This skill is a focused playbook for building LLM-as-judge systems and appears coherent and low-risk. Before installing or using it: (1) Review whether you want the evaluator to produce chain-of-thought justifications — these can reveal internal reasoning or sensitive prompt/context and can be disabled if you want only scores. (2) When evaluating private or sensitive content, ensure your evaluation pipeline does not send that data to external models or services you don't control. (3) Prefer usin...

详细分析 ▾

✓ 用途与能力

Name and description match the SKILL.md content: the doc is a detailed playbook for LLM-based evaluation (direct scoring, pairwise comparison, rubrics, bias mitigation). There are no requests for unrelated binaries, credentials, or system access that would be out of scope for an evaluation skill.

ℹ 指令范围

The instructions are detailed and focused on evaluation techniques, prompting patterns, bias mitigation, calibration, and statistical analysis. One notable instruction pattern is to require justifications (chain-of-thought) before giving scores; this is appropriate for reliability but can expose evaluator reasoning that some users may prefer to keep private. The skill does not instruct the agent to read local files, environment variables, or send results to external endpoints beyond whatever evaluation pipeline the user implements.

✓ 安装机制

Instruction-only skill with no install spec and no code files. This minimizes disk writes and arbitrary code execution; there are no download URLs, packages, or binaries installed by this skill.

✓ 凭证需求

The skill declares no required environment variables, credentials, or config paths. The guidance to use separate models for generation vs evaluation is sensible but not enforced by any hidden credential requests.

✓ 持久化与权限

always is false and the skill is user-invocable. It does not request persistent system presence or modify other skills' configuration. Autonomous invocation is allowed (platform default) but not combined with other privilege escalations here.

安全有层次，运行前请审查代码。

运行时依赖

无特殊依赖

版本

latestv1.0.02026/4/1

Initial release of advanced-evaluation, a comprehensive skill for building robust LLM evaluation systems. - Provides actionable guidance for implementing LLM-as-judge in automated pipelines. - Explains evaluation methods: direct scoring vs. pairwise comparison, with reliability and bias considerations. - Details systemic LLM biases (e.g., position, length, self-enhancement) and mitigation strategies. - Outlines metric selection frameworks for different evaluation tasks. - Supplies prompt templates and protocols for direct scoring, pairwise comparison, and rubric creation. - Offers practical patterns for evaluation pipeline design and rubric adaptation by domain.

● 无害

安装命令

点击复制

官方npx clawhub@latest install advanced-evaluation

镜像加速npx clawhub@latest install advanced-evaluation --registry https://cn.longxiaskill.com

数据来源：ClawHub ↗ · 中文优化：龙虾技能库