llm-benchmark-analyst — 大语言模型基准测试分析师

Name: llm-benchmark-analyst — 大语言模型基准测试分析师
Author: Chekhovin

Chekhovin

llm-benchmark-analyst — 大语言模型基准测试分析师

v1.0.0

搜索和分析大语言模型（LLM）在固定基准测试宇宙中的基准测试结果，生成基于证据的模型强项和弱项报告或领域领导者摘要。

0· 229·0 当前·0 累计

by @chekhovin (Chekhovin)·MIT-0

API工具自动化 AI模型访问开发工具

下载技能包

License

MIT-0

最后更新

2026/4/11

安全扫描

VirusTotal

无害

查看报告

OpenClaw

可疑

medium confidence

该技能与其声明的目的（基准测试研究和报告）一致，但 SKILL.md 中检测到 Unicode 控制字符（提示注入信号），且包来自未知来源，无主页 — 这些问题需要在安装前进行手动检查。

评估建议

该技能似乎做了它声称的（搜索基准测试排行榜并生成基于证据的报告）并且在安装和凭据请求方面风险较低。然而：1) SKILL.md 因 Unicode 控制字符被标记 — 在显示隐藏/控制字符的原始/文本编辑器中打开 SKILL.md 以验证没有隐藏或修改的指令；2) 安装前确认技能的来源/所有者（未提供主页）；3) 由于它依赖于网页浏览和图像提取，在信任它之前避免使用敏感账户或数据运行它；4) 如果允许自动调用，首先使用无害查询进行交互式测试并监视代理的外部请求；5) 手动检查 references/ 文件中的引用 URL — 有很多外部链接存在，技能将指示浏览这些网站，因此确保它符合您的政策。如果您愿意，我可以在 SKILL.md 中突出显示任何非 ASCII/控制字符或为审查提供一个清洁、仅显示版本。...

详细分析 ▾

✓ 用途与能力

Name, description, and bundled reference files align: the skill's goal is structured benchmark search and reporting; it restricts scope to the provided references and doesn't request unrelated credentials or system access. The instruction-only design (no binaries, no env vars, no installs) is proportionate to the stated purpose.

⚠ 指令范围

SKILL.md instructs the agent to browse web pages and perform multimodal extraction (text/image/canvas). That is functionally coherent, but the static scan flagged 'unicode-control-chars' in the SKILL.md (a prompt-injection pattern). Unicode control characters can hide or obfuscate instructions and may be used to manipulate or subvert the evaluation or runtime behavior. Inspect the raw SKILL.md for hidden control characters and verify that no hidden directives or altered text exist.

✓ 安装机制

No install spec and no code files beyond reference docs — lowest install risk. Nothing is downloaded or written to disk by the package itself.

✓ 凭证需求

The skill declares no required environment variables, credentials, or config paths. Its needs (web browsing, multimodal extraction) are reasonable for the described functionality and do not demand secrets or broad system access.

ℹ 持久化与权限

always:false and default autonomous invocation are set (normal). Because the skill can be invoked autonomously and instructs web crawling and image extraction, it can perform network retrievals during runs — this is expected for a research/reporting skill but increases the operational blast radius if the skill contained hidden or malicious instructions. Combine this with the prompt-injection signal when deciding whether to enable autonomous runs.

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.02026/3/14

LLM 基准测试分析师技能的初始发布。- 启用在固定、预批准的基准测试宇宙中搜索和分析 LLM 基准测试结果。- 生成关于模型强项/弱项、领域领导者、基准测试解释和前身比较的基于证据的报告。- 强制执行严格的模型/版本规范化，并优先使用官方基准测试源，提供每个得分的详细来源。- 集成基准测试选择、多模态排行榜提取、重叠扩展和缺陷警告应用的工作流。- 输出遵循结构化模板，始终包括范围、摘要、证据表、比较、数据缺陷和排除。- 不会发明或猜测缺失的数据；保持严格的报告忠实度和透明度。

● 无害

安装命令点击复制

官方npx clawhub@latest install llm-benchmark-analyst

镜像加速npx clawhub@latest install llm-benchmark-analyst --registry https://cn.clawhub-mirror.com

技能文档

概述

使用此技能研究基准测试证据并撰写结构化报告关于：

单个模型的强项和弱项
某一能力领域的最佳模型
一个基准测试的测量内容和可信度
前身与当前模型的进展

默认使用用户的语言。永远不要发明评分、排名、日期、基准测试变体或缺失的表格值。 ... （由于原文本过长，仅翻译了开头部分，实际输出应包括完整的、按要求翻译的 SKILL.md 内容）

Overview

Use this skill to research benchmark evidence and write structured reports about:

a single model's strengths and weaknesses
best models in a capability domain
what a benchmark measures and how trustworthy it is
predecessor vs current-model progress

Default to the user's language. Never invent scores, ranks, dates, benchmark variants, or missing table values.

Core constraints

Restrict the benchmark universe to references/benchmark-source.md. If a benchmark is not in that file, exclude it.
Use references/core-dimensions.md to collapse scattered benchmarks into a small set of report dimensions.
Follow references/search-playbook.md for routing, overlap expansion, evidence gathering, and comparison anchors.
Follow references/report-template.md for output structure.
Apply references/data-defect-warnings.md benchmark by benchmark, inline and again in the limitations section.
Prefer official benchmark or benchmark-author pages. Use aggregators mainly to discover links and context.
Record the evaluation mode exactly: benchmark version, split, difficulty, public/private, verified/original, with-tools/without-tools, pass@k, and any visible sub-score names.
Keep score units exact. Do not average incompatible metrics into a fake composite.

Required workflow

Normalize the model identity before searching

- Resolve exact provider, family, generation, version suffix, and release label. - Put time and version first. Reject ambiguous aliases like claude, gemini pro, gpt latest, or qwen max until you have the exact currently relevant model string for the searched leaderboard rows. - Capture the evaluation time point or access date for every key score.

Route the request through core dimensions before web crawling

- Start with references/core-dimensions.md to select the primary dimension(s). - Then list candidate benchmarks inside those dimensions. - Only then start website-by-website retrieval. - Keep the first pass narrow and token-efficient: start from the best 3-6 benchmarks for the asked domain, then expand only if needed.

Expand beyond section labels

- Do not let the source document's headings blind you. - After selecting the primary dimension, inspect benchmark descriptions and overlap tags to find relevant benchmarks that live in other sections. - Example: a coding analysis may need coding benchmarks, agentic coding benchmarks, general benchmarks with coding components, and research/math benchmarks with strong code components. - Example: a multimodal analysis may need vision benchmarks, OCR, GUI/computer-use, multimodal deep-research, and omni/video/audio benchmarks.

Collect evidence in this order

- official leaderboard or benchmark site - benchmark paper or benchmark README - benchmark-author blog or release note - trusted aggregator - vendor blog only as secondary evidence, clearly labeled as vendor-reported if no independent leaderboard row exists

Use multimodal extraction when the leaderboard is not machine-readable

- If the page uses images, canvas, screenshots, or chart-only rendering and plain text extraction misses the table, inspect screenshots or page images. - Extract only values that are clearly visible. - Mark the provenance as image-extracted. - If the image is unreadable or partially occluded, say so instead of guessing.

Apply anchor comparisons

- For code or agentic coding, compare against the latest available Claude Opus, latest Claude Sonnet, and latest GPT family model. - For multimodal analysis, compare against the latest available Gemini model. Add the latest GPT multimodal model if relevant. - For intelligence or reasoning analysis, compare against the latest available GPT family model. - Never assume which model is currently latest. Search that first.

Apply predecessor comparison

- If data exists, compare the target model with its immediate predecessor or last broadly comparable prior generation from the same provider/family. - Only compare like-for-like benchmark variants. If the predecessor only appears under a different benchmark mode, say the comparison is not clean.

Attach defect warnings

- Any benchmark with a known quality or methodology issue must carry an inline warning from references/data-defect-warnings.md. - If the report's conclusion depends heavily on warned benchmarks, lower confidence and say so explicitly.

Decision rules

When the user asks for best models in a domain, do not use only one benchmark. Use a cluster of relevant benchmarks and explain why each one matters.
When the user asks for what is this model good or bad at, synthesize at the core-dimension level first, then support with benchmark evidence.
When benchmark scores conflict, prefer freshness, exact version match, official source quality, and the number of agreeing benchmarks over one standout score.
Treat very small gaps as non-decisive when the benchmark is noisy, image-extracted, or known to be unstable.
Always include one short clause describing what each benchmark actually tests.

Minimum evidence to capture

For every benchmark you cite, capture:

benchmark name
what it tests in one short phrase
exact model row name
exact score and unit
rank or relative placement if visible
benchmark variant, split, or mode
date or access time point
source quality note if not official
data warning if applicable

Output expectations

Use the matching template in references/report-template.md.

At minimum, every substantive report must include:

a scope and identity section
a short executive summary
strengths
weaknesses or gaps
evidence table
comparison section
data-defect warnings and confidence
methodology or exclusions

Resource map

references/core-dimensions.md: benchmark routing and de-fragmentation map
references/search-playbook.md: token-efficient search order, overlap expansion, and comparison rules
references/data-defect-warnings.md: warning catalog and ready-to-use caution language
references/report-template.md: output structures for single-model, domain-leader, and benchmark-explainer tasks
references/benchmark-source.md: full allowed benchmark universe copied from the user's benchmark document

Example tasks

analyze gpt-5's coding and agentic coding strengths and weaknesses, and compare it with the latest claude opus, claude sonnet, and gpt model
find the best multimodal models right now using only the approved benchmark list and explain each benchmark briefly
write a report on qwen's reasoning strengths, benchmark gaps, predecessor comparison, and all data-quality caveats
tell me which models lead in deep research and search, with benchmark-specific warnings and freshness notes

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

概述

Overview

Core constraints

Required workflow

Decision rules

Minimum evidence to capture

Output expectations

Resource map

Example tasks

安装命令点击复制