\u2696\ufe0f Llm Evaluator — Llm工具
v1.0.0[AI辅助] LLM-as-a-Judge evaluation system using Langfuse. Score AI outputs on relevance, accuracy, hallucination, and helpfulness. Backfill scoring on historical trac...
详细分析 ▾
运行时依赖
版本
- Initial release of the llm-evaluator skill. - Provides an LLM-as-a-Judge system for evaluating AI outputs using relevance, accuracy, hallucination, and helpfulness scores. - Integrates with Langfuse and uses GPT-5-nano for efficient automated judging. - Enables batch backfill scoring for historical traces and real-time evaluation of outputs. - Command-line interface for testing, scoring specific traces, and running backfills.
安装命令 点击复制
技能文档
LLM-as-a-Judge evaluation system powered by Langfuse. Uses GPT-5-nano to score AI outputs.
当...时 到 使用
- Evaluating quality 的 搜索 results 或 AI responses
- Scoring traces 对于 relevance, accuracy, hallucination detection
- Batch scoring recent unscored traces
- Quality assurance 在...上 agent outputs
Usage
# Test with sample cases
python3 {baseDir}/scripts/evaluator.py test# Score a specific Langfuse trace
python3 {baseDir}/scripts/evaluator.py score
# Score with specific evaluator only
python3 {baseDir}/scripts/evaluator.py score --evaluators relevance
# Backfill scores on recent unscored traces
python3 {baseDir}/scripts/evaluator.py backfill --limit 20
Evaluators
| Evaluator | Measures | Scale |
|---|---|---|
| relevance | Response relevance to query | 0–1 |
| accuracy | Factual correctness | 0–1 |
| hallucination | Made-up information detection | 0–1 |
| helpfulness | Overall usefulness | 0–1 |
Credits
Built by M. Abidi | agxntsix.ai YouTube | GitHub Part of the AgxntSix Skill Suite for OpenClaw agents.📅 Need help setting up OpenClaw for your business? Book a free consultation
免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制