Pdf Extractor Skill — Pdf 提取器 技能
v1.0.0提取 text and LaTeX formulas from academic PDFs in English and Chinese, 输出ting structured Markdown with math, tables, and images preserved.
运行时依赖
安装命令
点击复制技能文档
PDF 提取器 技能
提取 text and mathematical formulas from academic PDF papers. Supports 机器人h English and Chinese content.
When to Use This 技能
Use this 技能 when:
User needs to 提取 text and LaTeX formulas from PDF papers User mentions "PDF转文本", "PDF提取公式", "论文OCR" User wants to convert academic papers to Markdown 格式化 工具 Selection 工具 Best For Languages Math 质量 Marker (推荐) 中英文论文、复杂公式 Chinese + English Excellent Nougat 纯英文论文、arXiv English only Excellent
推荐使用 Marker:支持中英文混排,公式识别效果更好。
环境 设置up
Conda 环境: pdf-提取器 Python Path: D:\anaconda3\envs\pdf-提取器\python.exe
Key Dependencies PyTorch 2.10.0+cu128 (CUDA 12.8) marker-pdf (Surya OCR + Texify) nougat-ocr 0.1.17 转换ers 导入ant: Keep This 技能 Self-ContAIned (No Extra 安装s)
This 技能 is expected to 运行 using ONLY the existing pdf-提取器 conda 环境 and the scripts in scripts/.
Rules:
Do NOT 运行 pip 安装 ... / conda 安装 ... / 下载 random libraries during 提取ion. If a dependency is missing (e.g., Nougat crashes due to missing torchvision), do NOT try to fix by 安装ing packages. Switch 工具s (prefer Marker) or 报告 the 环境 issue. Slow 运行time is normal for Marker (especially with --ark-code-latest). Prefer splitting the PDF rather than changing 工具s or 添加ing dependencies.
Recommended 应用roach for long PDFs:
Use --page-range (0-based) to 提取 per page or small page batches. Merge the 结果ing markdown files afterward (simple concatenation is fine). Keep the combined file in the same folder as the per-page 输出s so image links remAIn valid.
Example (per-page 提取ion with LLM mode):
D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out/page_01.md"
工具 1: Marker (推荐 - 中英文支持) Command Line # 转换中文论文 (默认支持中英文) D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "论文.pdf"
# 指定输出路径 D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" -o "输出.md"
# 强制 OCR (用于扫描版 PDF) D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "扫描ned.pdf" --force-ocr
# 使用火山方舟 Coding Plan (OpenAI-compatible) 增强转换质量(表格/公式/跨页结构更稳) # 注意:默认走 ark-code-latest,后台会自动路由到合适的模型 D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest
# 只跑第 1 页做快速验证(0-based page 索引) D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" --ark-code-latest --page-range "0" -o "out_first_page.md"
# 如需自定义(不推荐):也可以手动指定 --openAI-base-url/--openAI-API-key/--openAI-模型
# 指定语言 D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2md_marker.py "paper.pdf" --languages Chinese English Japanese
Python API 导入 sys sys.path.insert(0, r'C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts') from pdf2md_marker 导入 convert_pdf, convert_pdf_命令行工具
# 简单用法 输出_file = convert_pdf_命令行工具('论文.pdf', '输出.md')
# 完整 API markdown_text, metadata = convert_pdf( 'paper.pdf', 输出_dir='./输出', force_ocr=False, batch_multiplier=2, languages=['Chinese', 'English'] ) print(markdown_text)
Marker Options Option Description -o, --输出 输出 file (.md) or directory --force-ocr Force OCR even for text PDFs --batch-multiplier Batch size multiplier (default: 2) --languages Languages in document (default: Chinese English) 工具 2: Nougat (纯英文论文) Command Line # Convert entire PDF D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2latex.py "paper.pdf"
# Convert specific pages D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2latex.py "paper.pdf" -p 0-5
# Custom 输出 D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2latex.py "paper.pdf" -o 输出.mmd
# Save each page separately D:\anaconda3\envs\pdf-提取器\python.exe C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts\pdf2latex.py "paper.pdf" --per-page
Python API 导入 sys sys.path.insert(0, r'C:\Users\cr\.config\opencode\技能s\pdf-提取器\scripts') from pdf2latex 导入 load_模型, process_pdf, save_结果s
# Load 模型 (uses GPU if avAIlable) 模型, device = load_模型()
# Process PDF 结果s = process_pdf('paper.pdf', 模型, device)
# Save as single markdown file save_结果s(结果s, '输出.mmd')
# Or save per page save_结果s(结果s, '输出_pages/', 格式化='pages')
Nougat Options Option Description -o, --输出 输出 file or directory -p, --pages Page range (e.g., "0-5" or "1,3,5") -m, --模型 模型 tag (default: 0