首页龙虾技能列表 › Hle Benchmark Evolver — 技能工具

Hle Benchmark Evolver — 技能工具

v1.0.0

Runs HLE-oriented benchmark reward ingestion and curriculum generation for capability-evolver. Use when the user asks to optimize Humanity's Last Exam score,...

0· 623·0 当前·0 累计
by @wanng-ide (WANGJUNJIE)·MIT-0
下载技能包
License
MIT-0
最后更新
2026/4/12
安全扫描
VirusTotal
可疑
查看报告
OpenClaw
可疑
medium confidence
The skill generally does what its description says (ingest HLE reports and produce curriculum signals) but it relies on undeclared sibling modules and can run arbitrary shell commands with full environment access — behaviours that warrant caution before installing.
评估建议
This skill appears to implement HLE report ingestion and curriculum generation, but take these precautions before installing or running it: - Ensure the expected sibling modules exist: capability-evolver (or feishu-evolver-wrapper). Inspect their src/gep/benchmarkReward.js and index.js to confirm what state files and side effects they perform. - Avoid passing untrusted commands to --eval_cmd. The pipeline will run that command via the shell (and may execute it as a temporary script using a logi...
详细分析 ▾
用途与能力
The code implements ingestion, reporting, and a pipeline that calls out to a 'capability-evolver' (or a 'feishu-evolver-wrapper') module and invokes that skill's index.js for evolve/solidify. That dependency is not declared in the SKILL.md or package metadata; the skill will fail or behave differently if those sibling modules are missing. Otherwise the requested capabilities (parse report → ingest → generate curriculum signals → optionally drive evolve/solidify) match the stated purpose.
指令范围
SKILL.md and the scripts allow/encourage executing arbitrary evaluator commands via --eval_cmd which are run through the shell (runShell) and may be written to a temporary script and executed via 'bash -l'. This grants those commands full access to the process environment and filesystem that the agent runs with and can run arbitrary code, read files, or exfiltrate data. The instructions do not warn about that risk or restrict which commands may be executed.
安装机制
There is no network download or install spec — the skill is instruction + local JS files only. No external packages are fetched. That lowers install risk, but the skill expects local sibling modules to exist (capability-evolver or feishu-evolver-wrapper).
凭证需求
The skill declares no required env vars, which is consistent with its metadata, but at runtime it spawns child processes and passes the full process.env to them. Those child processes (eval_cmd or invoked index.js in capability-evolver) can access any environment secrets available to the agent. Also the skill reads/writes state files via the external benchmarkReward module — the path and contents of those files are not documented in SKILL.md.
持久化与权限
always:false and no explicit persistent installation are used. The skill writes temporary shell scripts to the current working directory when executing complex commands and will rely on state files in the capability-evolver module's state path. It does not modify other skills' configs directly, but it calls other skill code (capability-evolver) which could have broader effects — verify those sibling modules before use.
安全有层次,运行前请审查代码。

License

MIT-0

可自由使用、修改和再分发,无需署名。

运行时依赖

无特殊依赖

版本

latestv1.0.02026/2/16

- Initial release of hle-benchmark-evolver skill for OpenClaw. - Enables ingestion of HLE benchmark report JSONs to drive curriculum and evolution workflows. - Supports easy-first curriculum queues, focus area suggestion, and immediate result summaries. - Offers shell commands for both single-run and fully automated evolution-feedback loops. - Always outputs compact, structured JSON summarizing key progress metrics and curriculum focus.

● 可疑

安装命令 点击复制

官方npx clawhub@latest install hle-benchmark-evolver
镜像加速npx clawhub@latest install hle-benchmark-evolver --registry https://cn.clawhub-mirror.com

技能文档

This skill operationalizes HLE score-driven evolution for OpenClaw.

When to Use

  • User asks to improve HLE score (for example target >= 60%).
  • User provides question-level benchmark output and wants it converted to reward.
  • User wants easy-first curriculum queue and next-focus questions.
  • User asks for an immediate benchmark result snapshot.

Inputs

  • Benchmark report JSON path (--report=/abs/path/report.json)
  • Optional benchmark id (cais/hle default)

Workflow

  • Validate the report JSON exists and is parseable.
  • Ingest report into capability-evolver benchmark reward state.
  • Generate curriculum signals:
- benchmark_ - curriculum_stage: - focus_subject: - focus_modality: - question_focus:*
  • Return a compact result summary for this run.

Run

node skills/hle-benchmark-evolver/run_result.js --report=/absolute/path/hle_report.json

Full automatic loop (starts evolution cycle):

node skills/hle-benchmark-evolver/run_pipeline.js --report=/absolute/path/hle_report.json --cycles=1

If your evaluator can be called from shell, let pipeline generate the report each cycle:

node skills/hle-benchmark-evolver/run_pipeline.js \
  --report=/absolute/path/hle_report.json \
  --eval_cmd="python /path/to/eval_hle.py --out {{report}}" \
  --cycles=3 --interval_ms=2000

If no --report is provided, it defaults to:

skills/capability-evolver/assets/gep/hle_report.template.json

Output Contract

Always print JSON with these fields:

  • benchmark_id
  • run_id
  • accuracy
  • reward
  • trend
  • curriculum_stage
  • queue_size
  • focus_subjects
  • focus_modalities
  • next_questions

Notes

  • This skill handles reward/curriculum ingestion. It does not directly solve HLE questions.
  • run_pipeline.js links ingestion, evolve, and solidify into one executable loop.
数据来源:ClawHub ↗ · 中文优化:龙虾技能库
OpenClaw 技能定制 / 插件定制 / 私有工作流定制

免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制

了解定制服务