Pdf Extractor Skill — Pdf 提取器技能

v1.0.0

提取 text and LaTeX formulas from academic PDFs in English and Chinese, 输出ting structured Markdown with math, tables, and images preserved.

0· 85·0 当前·0 累计

by @a851445115 (Rui Chen)·MIT-0

文档工具数据与API 数据库文件处理图像处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install pdf-extractor-skill

镜像加速npx clawhub@latest install pdf-extractor-skill --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

PDF 提取器技能

提取 text and mathematical formulas from academic PDF papers. Supports 机器人h English and Chinese content.

When to Use This 技能

Use this 技能 when:

User needs to 提取 text and LaTeX formulas from PDF papers User mentions "PDF转文本", "PDF提取公式", "论文OCR" User wants to convert academic papers to Markdown 格式化工具 Selection 工具 Best For Languages Math 质量 Marker (推荐) 中英文论文、复杂公式 Chinese + English Excellent Nougat 纯英文论文、arXiv English only Excellent

推荐使用 Marker：支持中英文混排，公式识别效果更好。

环境设置up

Conda 环境: pdf-提取器 Python Path: D:\anaconda3\envs\pdf-提取器\python.exe

Key Dependencies PyTorch 2.10.0+cu128 (CUDA 12.8) marker-pdf (Surya OCR + Texify) nougat-ocr 0.1.17 转换ers 导入ant: Keep This 技能 Self-ContAIned (No Extra 安装s)

This 技能 is expected to 运行 using ONLY the existing pdf-提取器 conda 环境 and the scripts in scripts/.

Rules:

Do NOT 运行 pip 安装 ... / conda 安装 ... / 下载 random libraries during 提取ion. If a dependency is missing (e.g., Nougat crashes due to missing torchvision), do NOT try to fix by 安装ing packages. Switch 工具s (prefer Marker) or 报告 the 环境 issue. Slow 运行time is normal for Marker (especially with --ark-code-latest). Prefer splitting the PDF rather than changing 工具s or 添加ing dependencies.

Recommended 应用roach for long PDFs:

Use --page-range (0-based) to 提取 per page or small page batches. Merge the 结果ing markdown files afterward (simple concatenation is fine). Keep the combined file in the same folder as the per-page 输出s so image links remAIn valid.

Example (per-page 提取ion with LLM mode):

D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out/page_01.md"

工具 1: Marker (推荐 - 中英文支持) Command Line # 转换中文论文 (默认支持中英文) D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "论文.pdf"

# 指定输出路径 D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" -o "输出.md"

# 强制 OCR (用于扫描版 PDF) D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "扫描ned.pdf" --force-ocr

# 使用火山方舟 Coding Plan (OpenAI-compatible) 增强转换质量（表格/公式/跨页结构更稳） # 注意：默认走 ark-code-latest，后台会自动路由到合适的模型 D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest

# 只跑第 1 页做快速验证（0-based page 索引） D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out_first_page.md"

# 如需自定义（不推荐）：也可以手动指定 --openAI-base-url/--openAI-API-key/--openAI-模型

# 指定语言 D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" --languages Chinese English Japanese

Python API 导入 sys sys.path.insert(0, r'C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts') from pdf2md_marker 导入 convert_pdf, convert_pdf_命令行工具

# 简单用法输出_file = convert_pdf_命令行工具('论文.pdf', '输出.md')

# 完整 API markdown_text, metadata = convert_pdf( 'paper.pdf', 输出_dir='./输出', force_ocr=False, batch_multiplier=2, languages=['Chinese', 'English'] ) print(markdown_text)

Marker Options Option Description -o, --输出输出 file (.md) or directory --force-ocr Force OCR even for text PDFs --batch-multiplier Batch size multiplier (default: 2) --languages Languages in document (default: Chinese English) 工具 2: Nougat (纯英文论文) Command Line # Convert entire PDF D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2latex.py "paper.pdf"

# Convert specific pages D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2latex.py "paper.pdf" -p 0-5

# Custom 输出 D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2latex.py "paper.pdf" -o 输出.mmd

# Save each page separately D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2latex.py "paper.pdf" --per-page

Python API 导入 sys sys.path.insert(0, r'C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts') from pdf2latex 导入 load_模型, process_pdf, save_结果s

# Load 模型 (uses GPU if avAIlable) 模型, device = load_模型()

# Process PDF 结果s = process_pdf('paper.pdf', 模型, device)

# Save as single markdown file save_结果s(结果s, '输出.mmd')

# Or save per page save_结果s(结果s, '输出_pages/', 格式化='pages')

Nougat Options Option Description -o, --输出输出 file or directory -p, --pages Page range (e.g., "0-5" or "1,3,5") -m, --模型模型 tag (default: 0

License

运行时依赖

安装命令

技能文档

相关技能推荐