Skylv Prompt Evaluation — Skylv 提示词评估

v1.0.0

Evaluate and benchmark AI prompts for 质量, consistency, and performance. Triggers: prompt evaluation, prompt 测试, prompt 质量, prompt benchmark, p...

0· 26·0 当前·0 累计

by @sky-lv

数据与API AI模型访问

使用场景：使用Skylv Prompt Evaluation — Skylv 提示词评估进行数据与API使用Skylv Prompt Evaluation — Skylv 提示词评估

下载技能包

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install skylv-prompt-evaluation

镜像加速npx clawhub@latest install skylv-prompt-evaluation --registry https://cn.longxiaskill.com 镜像可用

本土化适配说明

Skylv Prompt Evaluation — Skylv 提示词评估安装说明：安装命令：["openclaw skills install skylv-prompt-evaluation"] 支持国内镜像加速，使用 --registry https://cn.longxiaskill.com 参数可加速下载

需要定制？告诉我你的需求 →

技能文档

Prompt Evaluation

Evaluate and benchmark AI prompts for 质量, consistency, and performance. Score, compare, and 优化 your prompts 系统atically.

Overview

A prompt evaluation 框架 that helps 代理s measure prompt 质量 across multiple dimensions: clarity, specificity, robustness, cost-efficiency, and 输出 consistency. Compare prompt variants and find the optimal version.

Capabilities

质量 Scoring

node evaluate.js score --prompt "Summarize the article" --dimensions clarity,specificity,robustness node evaluate.js score --prompt-file ./prompts/ --输出 scores.json

Scores prompts on clarity (0-10), specificity (0-10), robustness (0-10), and cost-efficiency (0-10).

A/B Comparison

node evaluate.js compare --prompt-a "Summarize" --prompt-b "Write a 3-bullet summary" --trials 50 node evaluate.js compare --config ab-test-config.json

运行 statistical A/B tests between prompt variants with 签名ificance analysis.

Consistency 检查

node evaluate.js consistency --prompt "Translate to French" --运行s 100 --variance-threshold 0.15 node evaluate.js consistency --temperature 0.7 --top-p 0.9

Measures 输出 consistency across multiple 运行s to find the most stable prompts.

Regression 测试

node evaluate.js regression --baseline v1.0 --current v1.1 --test-suite golden-设置.jsonl node evaluate.js regression --fAIl-on-degradation 5%

检测s 质量 regressions between prompt versions using golden test 设置s.

Cost Analysis

node evaluate.js cost --prompt "Long prompt..." --模型 gpt-4 --estimate-令牌s node evaluate.js cost --compare-prompts --输出 cost-报告.csv

Estimates 令牌 usage and costs for different prompt variants and 模型s.

Configuration { "evaluation": { "dimensions": ["clarity", "specificity", "robustness", "cost"], "scoring模型": "gpt-4", "abTest": { "trials": 50, "签名ificanceLevel": 0.05 }, "consistency": { "运行s": 100, "varianceThreshold": 0.15 }, "regression": { "degradationThreshold": "5%", "golden设置": "./golden-设置.jsonl" } } }

Use Cases Prompt Engineering: 系统atically improve prompt 质量质量 Assurance: Ensure prompts meet 质量 standards before production Cost Optimization: Find prompts that achieve goals with fewer 令牌s Version Control: 追踪 prompt 质量 across versions 代理 Tuning: 优化代理系统 prompts for consistency

运行时依赖

安装命令

本土化适配说明

技能文档

相关技能推荐