Model Benchmark — 技能工具

Name: Model Benchmark — 技能工具
Author: leosheep821-debug

leosheep821-debug

Model Benchmark — 技能工具

v0.1.0

深度测评各模型在 OpenClaw 上的实际表现，支持中文理解/代码/推理/工具调用多维度评估。

0· 86·0 当前·0 累计

by @leosheep821-debug·MIT-0

AI模型访问开发工具代码生成

下载技能包

License

MIT-0

最后更新

2026/3/23

安全扫描

VirusTotal

无害

查看报告

OpenClaw

可疑

medium confidence

The skill is coherent for benchmarking models, but its runtime instructions reference obtaining and using multiple external API keys and editing OpenClaw config without declaring or explaining how secrets will be provided or stored — this mismatch warrants caution.

评估建议

This skill appears to be a legitimate benchmarking instruction set, but it refers to obtaining and using multiple external provider API keys without declaring them in the metadata or describing how to provide or store them. Before installing or using it: (1) Confirm how you'll supply provider keys — prefer ephemeral or least-privilege keys and avoid pasting long-lived secrets into third-party UIs. (2) Understand where keys will be stored (models.json) and check file permissions; back up the orig...

详细分析 ▾

ℹ 用途与能力

The name, description, and SKILL.md consistently describe a model benchmarking framework and include sensible test cases and report format. The SKILL.md also legitimately references adding providers to OpenClaw's models.json and using provider API keys for GLM-5, Qwen, etc., which is expected for a benchmarking skill that talks to external models.

ℹ 指令范围

The instructions stay within benchmarking scope (test items, scoring, report format). They reference specific operational items: editing OpenClaw models.json to add providers, using a local proxy at 127.0.0.1:8766, and acquiring provider API keys. They do not instruct the agent to read unrelated system files or exfiltrate data, but they do not specify safe handling or storage of credentials.

✓ 安装机制

No install spec and no code files are provided (instruction-only), so nothing will be written to disk or installed by the skill itself. This is the lowest-risk install model.

⚠ 凭证需求

The SKILL.md explicitly lists provider API Key needs (GLM-5, Qwen, etc.) but the skill metadata declares no required environment variables or primary credential. That mismatch means the skill may expect the user/agent to supply secrets via models.json or prompts at runtime; the skill gives no guidance on where keys are stored, what permissions are needed, or whether keys will be transmitted to other endpoints. Requiring multiple external API keys is proportionate to benchmarking, but the lack of declared/env guidance and storage instructions is a privacy/operational concern.

✓ 持久化与权限

The skill is not always-included and does not request system-level persistence. It does mention editing OpenClaw configuration (models.json) which is a normal and limited config change for integrating providers; there is no indication it modifies other skills or system-wide settings beyond provider config advice.

安全有层次，运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv0.1.02026/3/23

- Initial release of model-benchmark skill for deep evaluation of models on OpenClaw. - Supports multidimensional assessment: Chinese understanding, coding, reasoning, and tool-use evaluation. - Includes a standardized test set and scoring rubrics for consistent benchmarking. - Documents required APIs and configuration methods for adding new model providers. - Provides a detailed report template for presenting model evaluation results.

● 无害

安装命令点击复制

官方npx clawhub@latest install model-benchmark

镜像加速npx clawhub@latest install model-benchmark --registry https://cn.clawhub-mirror.com

技能文档

创建：2026-03-23
目标：深度测评各模型在 OpenClaw 上的实际表现

测试环境

平台：Matrix Agent（OpenClaw 2026.3.3）
当前模型：minimax/auto（上下文200k，MaxTokens 8192）
代理：127.0.0.1:8766（MiniMax内部代理）
Thinking：关闭状态

待测模型池

模型	Provider	状态	优先级
MiniMax Auto	minimax	✅已测	—
GLM-5	智谱/百炼	🔜待测	P1
Qwen3-235B-A22B	百炼（MoE，235B参数）	🔜待测	P1
Claude Opus 4 (thinking-medium)	anthropic-via-proxy	🔜待测	P1
DeepSeek R1	待确认	🔜待测	P2
GPT-4o	OpenAI	待确认	P2

API Key 需求

GLM-5：需智谱API Key
Qwen3-235B-A22B：需阿里云百炼Key
测试方法：通过 OpenClaw models.json 配置新 provider

测评维度

维度	权重	测试内容
中文理解	25%	解释复杂概念，用小学生能懂的话
代码能力	25%	Python实现，简洁可运行
工具调用	20%	解释工具调用对Agent的重要性
复杂推理	20%	多步骤逻辑推理题
响应速度	10%	从发题到返回的时间

测试题库（标准题）

测试1：中文理解与创意

请用一段不超过100字的话，解释"量子纠缠"，要求：小学生能看懂，且有一定文采。

测试2：代码能力

写一个Python函数，判断一个字符串是否是回文，要求代码简洁、注释清晰、可直接运行。

测试3：工具调用能力

解释为什么"工具调用能力"对AI Agent至关重要？要求结合实际场景，不超过150字。

测试4：复杂推理

张三比李四大3岁。李四比王五小2岁。王五20岁。问：三人年龄之和是多少？请写出推理过程。

报告格式

# 模型测评报告：{模型名} 日期：YYYY-MM-DD

总分：X/10

维度	得分	评语
中文理解	X/10	...
代码能力	X/10	...
工具调用	X/10	...
复杂推理	X/10	...
响应速度	X/10	...

亮点

不足

结论

数据来源：ClawHub ↗ · 中文优化：龙虾技能库

OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险，如需更匹配、更安全的方案，建议联系付费定制

了解定制服务

License

运行时依赖

版本

安装命令 点击复制

技能文档

测试环境

待测模型池

API Key 需求

测评维度

测试题库（标准题）

测试1：中文理解与创意

测试2：代码能力

测试3：工具调用能力

测试4：复杂推理

报告格式

总分：X/10

亮点

不足

结论

测试环境

待测模型池

API Key 需求

测评维度

测试题库（标准题）

测试1：中文理解与创意

测试2：代码能力

测试3：工具调用能力

测试4：复杂推理

报告格式

总分：X/10

亮点

不足

结论

安装命令点击复制