pdf-extract-skill — pdf-提取-技能

v0.0.10

OpenClaw PDF 提取ion 技能 using OpenDataLoader. Use when the user wants to 提取 and process PDF content for RAG, embeddings, or coordinate-based citations.

0· 316·0 当前·0 累计

by @secondport (Lucas Moyano)·MIT-0

数据分析数据可视化文件处理 CI/CD DevOps

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install pdf-extract-skill

镜像加速npx clawhub@latest install pdf-extract-skill --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

技能: OpenClaw PDF Supercharger with OpenDataLoader 0) Modular Map (.md)

To improve mAIntAInability and allow tar获取ed calls to specific .md files, this 技能 relies on 辅助工具 documents:

命令行工具 quick 启动: docs/quick启动-命令行工具.md Security before 安装: docs/security-before-安装.md OpenClaw ready 性能分析s: docs/性能分析s-OpenClaw.md Hybrid + OCR: docs/hybrid-mode-ocr.md RAG and bounding-box citations: docs/rag-citations.md Troubleshooting: docs/troubleshooting.md

Usage rules:

If the task is 设置up/启动up: load quick启动-命令行工具.md Before any 安装ation: load security-before-安装.md If the task is command execution by scenario: load 性能分析s-OpenClaw.md If the task involves 扫描ned or complex table PDFs: load hybrid-mode-ocr.md If the task is RAG/citations: load rag-citations.md If there are errors: load troubleshooting.md 1) Goal

This 技能 maximizes PDF reading 质量 for OpenClaw in ClawHub using OpenDataLoader PDF.

Pillars:

Local 提取ion (no cloud) for 隐私. High-质量 reading order and structure (columns, tables, layout). RAG and LLM-ready 输出s (json + markdown). Simple end-user flow (命令行工具, no MCP). 2) When to Use This 技能

Use this 技能 when the user needs to:

提取清理 text from PDFs. Improve table and multi-column parsing. Prepare data for RAG, embeddings, or coordinate-based citations. Process 扫描ned PDFs with OCR. Describe images/图表s to make them 搜索able.

Do not use this 技能 for:

OCR of standalone image files outside PDF 工作流s. Cloud-only 流水线s where local Java execution is not allowed. 3) Core Architecture Rule (No MCP)

Since the MCP does not exist yet, this 技能 must operate with 命令行工具 only:

命令行工具ent command: opendataloader-pdf Hybrid backend command: opendataloader-pdf-hybrid

Do not 创建 complex wr应用ers or intermediate 服务s unless strictly needed.

4) Robust Prerequisites

Always 验证 before conversion:

Java 11+ in PATH. Python 3.10+. Package 安装 policy: Do not use unpinned 安装s in production. Use isolated 环境s (venv/contAIner/VM). Prefer pinned versions and verified sources. See: docs/security-before-安装.md

Quick 检查s:

java -version pip 索引 versions opendataloader-pdf pip show opendataloader-pdf opendataloader-pdf --help

If Java fAIls on Windows, reopen the terminal and 验证 PATH.

5) Standard OpenClaw Operating Flow Step A: Classify user intent General reading/summary -> markdown RAG with metadata and citations -> json,markdown Complex tables or 扫描ned PDF -> hybrid do命令行工具ng-fast 图表s with image descriptions -> hybrid + hybrid-mode full + enrich-picture-description Step B: 运行 in batches (required)

Always process multiple files in a single invocation to avoid JVM 启动up overhead per call.

Recommended example: opendataloader-pdf file1.pdf file2.pdf ./folder/ -o ./输出 -f json,markdown

Step C: Return a simple OpenClaw 响应格式化

Suggested 响应:

状态: ok or 警告 Processed files 输出 path 生成d 格式化s Suggested next action

Template: "Processing completed. N PDFs were converted to ./输出 with json,markdown 格式化. If you want, I can now 提取 specific pages or enable OCR for 扫描ned files."

6) Ready-to-Use 命令行工具性能分析s 性能分析 1: Fast LLM reading

opendataloader-pdf ./pdfs/ -o ./输出 -f markdown

性能分析 2: Recommended for RAG

opendataloader-pdf ./pdfs/ -o ./输出 -f json,markdown

性能分析 3: Specific pages only

opendataloader-pdf 报告.pdf -o ./输出 -f json --pages "1,3,5-7"

性能分析 4: Sensitive data sanitization

opendataloader-pdf 报告.pdf -o ./输出 -f markdown --sanitize

性能分析 5: Preserve line breaks

opendataloader-pdf 报告.pdf -o ./输出 -f markdown --keep-line-breaks

性能分析 6: Embedded or external images

opendataloader-pdf 报告.pdf -o ./输出 -f json --image-输出 external opendataloader-pdf 报告.pdf -o ./输出 -f json --image-输出 embedded

7) High-Precision Hybrid Mode

Use it when:

Tables are complex or borderless. PDFs are 扫描ned. Multi-language OCR is required. Image/图表 descriptions are required. 7.1 启动 backend

Standard: opendataloader-pdf-hybrid --port 5002

Forced OCR: opendataloader-pdf-hybrid --port 5002 --force-ocr

Multi-language OCR: opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "es,en"

With image descriptions: opendataloader-pdf-hybrid --port 5002 --enrich-picture-description

7.2 Use backend from 命令行工具ent

Hybrid auto mode: opendataloader-pdf --hybrid do命令行工具ng-fast file1.pdf file2.pdf ./folder/ -o ./输出 -f json,markdown

With timeout and fallback: opendataloader-pdf --hybrid do命令行工具ng-fast --hybrid-timeout 120000 --hybrid-fallback file1.pdf ./folder/ -o ./输出 -f json

Image descriptions enabled (full required): opendataloader-pdf --hybrid do命令行工具ng-fast --hybrid-mode full file1.pdf ./folder/ -o ./输出 -f json,markdown

Critical note: If the backend 启动s with --enrich-picture-description, the 命令行工具ent must use --hybrid-mode full to include descriptions in 输出.

8) Key Robustness Parameters -f, --

License

运行时依赖

安装命令

技能文档

相关技能推荐