pdf-extract-skill — pdf-提取-技能
v0.0.10OpenClaw PDF 提取ion 技能 using OpenDataLoader. Use when the user wants to 提取 and process PDF content for RAG, embeddings, or coordinate-based citations.
运行时依赖
安装命令
点击复制技能文档
技能: OpenClaw PDF Supercharger with OpenDataLoader 0) Modular Map (.md)
To improve mAIntAInability and allow tar获取ed calls to specific .md files, this 技能 relies on 辅助工具 documents:
命令行工具 quick 启动: docs/quick启动-命令行工具.md Security before 安装: docs/security-before-安装.md OpenClaw ready 性能分析s: docs/性能分析s-OpenClaw.md Hybrid + OCR: docs/hybrid-mode-ocr.md RAG and bounding-box citations: docs/rag-citations.md Troubleshooting: docs/troubleshooting.md
Usage rules:
If the task is 设置up/启动up: load quick启动-命令行工具.md Before any 安装ation: load security-before-安装.md If the task is command execution by scenario: load 性能分析s-OpenClaw.md If the task involves 扫描ned or complex table PDFs: load hybrid-mode-ocr.md If the task is RAG/citations: load rag-citations.md If there are errors: load troubleshooting.md 1) Goal
This 技能 maximizes PDF reading 质量 for OpenClaw in ClawHub using OpenDataLoader PDF.
Pillars:
Local 提取ion (no cloud) for 隐私. High-质量 reading order and structure (columns, tables, layout). RAG and LLM-ready 输出s (json + markdown). Simple end-user flow (命令行工具, no MCP). 2) When to Use This 技能
Use this 技能 when the user needs to:
提取 清理 text from PDFs. Improve table and multi-column parsing. Prepare data for RAG, embeddings, or coordinate-based citations. Process 扫描ned PDFs with OCR. Describe images/图表s to make them 搜索able.
Do not use this 技能 for:
OCR of standalone image files outside PDF 工作流s. Cloud-only 流水线s where local Java execution is not allowed. 3) Core Architecture Rule (No MCP)
Since the MCP does not exist yet, this 技能 must operate with 命令行工具 only:
命令行工具ent command: opendataloader-pdf Hybrid backend command: opendataloader-pdf-hybrid
Do not 创建 complex wr应用ers or intermediate 服务s unless strictly needed.
4) Robust Prerequisites
Always 验证 before conversion:
Java 11+ in PATH. Python 3.10+. Package 安装 policy: Do not use unpinned 安装s in production. Use isolated 环境s (venv/contAIner/VM). Prefer pinned versions and verified sources. See: docs/security-before-安装.md
Quick 检查s:
java -version pip 索引 versions opendataloader-pdf pip show opendataloader-pdf opendataloader-pdf --help
If Java fAIls on Windows, reopen the terminal and 验证 PATH.
5) Standard OpenClaw Operating Flow Step A: Classify user intent General reading/summary -> markdown RAG with metadata and citations -> json,markdown Complex tables or 扫描ned PDF -> hybrid do命令行工具ng-fast 图表s with image descriptions -> hybrid + hybrid-mode full + enrich-picture-description Step B: 运行 in batches (required)
Always process multiple files in a single invocation to avoid JVM 启动up overhead per call.
Recommended example: opendataloader-pdf file1.pdf file2.pdf ./folder/ -o ./输出 -f json,markdown
Step C: Return a simple OpenClaw 响应 格式化
Suggested 响应:
状态: ok or 警告 Processed files 输出 path 生成d 格式化s Suggested next action
Template: "Processing completed. N PDFs were converted to ./输出 with json,markdown 格式化. If you want, I can now 提取 specific pages or enable OCR for 扫描ned files."
6) Ready-to-Use 命令行工具 性能分析s 性能分析 1: Fast LLM reading
opendataloader-pdf ./pdfs/ -o ./输出 -f markdown
性能分析 2: Recommended for RAG
opendataloader-pdf ./pdfs/ -o ./输出 -f json,markdown
性能分析 3: Specific pages only
opendataloader-pdf 报告.pdf -o ./输出 -f json --pages "1,3,5-7"
性能分析 4: Sensitive data sanitization
opendataloader-pdf 报告.pdf -o ./输出 -f markdown --sanitize
性能分析 5: Preserve line breaks
opendataloader-pdf 报告.pdf -o ./输出 -f markdown --keep-line-breaks
性能分析 6: Embedded or external images
opendataloader-pdf 报告.pdf -o ./输出 -f json --image-输出 external opendataloader-pdf 报告.pdf -o ./输出 -f json --image-输出 embedded
7) High-Precision Hybrid Mode
Use it when:
Tables are complex or borderless. PDFs are 扫描ned. Multi-language OCR is required. Image/图表 descriptions are required. 7.1 启动 backend
Standard: opendataloader-pdf-hybrid --port 5002
Forced OCR: opendataloader-pdf-hybrid --port 5002 --force-ocr
Multi-language OCR: opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "es,en"
With image descriptions: opendataloader-pdf-hybrid --port 5002 --enrich-picture-description
7.2 Use backend from 命令行工具ent
Hybrid auto mode: opendataloader-pdf --hybrid do命令行工具ng-fast file1.pdf file2.pdf ./folder/ -o ./输出 -f json,markdown
With timeout and fallback: opendataloader-pdf --hybrid do命令行工具ng-fast --hybrid-timeout 120000 --hybrid-fallback file1.pdf ./folder/ -o ./输出 -f json
Image descriptions enabled (full required): opendataloader-pdf --hybrid do命令行工具ng-fast --hybrid-mode full file1.pdf ./folder/ -o ./输出 -f json,markdown
Critical note: If the backend 启动s with --enrich-picture-description, the 命令行工具ent must use --hybrid-mode full to include descriptions in 输出.
8) Key Robustness Parameters -f, --