xmind-doc-parser — 技能工具
v1.0.1[自动翻译] Parse documents in 18+ formats using Baidu API to extract text, tables, layout, OCR scanned images, and produce document chunks for RAG.
详细分析 ▾
运行时依赖
版本
- Skill renamed from "xmind-doc-parser" to "baidu-doc-parser" to accurately reflect functionality. - Now parses documents using Baidu Document Parser API, supporting 18+ formats (PDF, Word, Excel, PowerPoint, images, and more). - Offers comprehensive extraction: text, tables, layout analysis, OCR for scanned docs, and document chunking for RAG. - Enhanced documentation: details on API parameters, environment setup, file/format/language support, error codes, and usage examples. - Adds command-line script for easy testing and reference links to official resources.
安装命令
点击复制本土化适配说明
xmind-doc-parser — 技能工具 安装说明: 安装命令:npx clawhub@latest install xmind-doc-parser
技能文档
Parse documents using Baidu Intelligent Document Analysis Platform API.
Overview
This skill provides document parsing capabilities through Baidu's Document Parser API, supporting:
- 18+ document formats (PDF, Word, Excel, PowerPoint, images, etc.)
- Text extraction
- Table recognition and extraction
- Layout analysis (titles, paragraphs, headers/footers, etc.)
- OCR for scanned documents
- Document chunking for RAG applications
- Multi-language support (Chinese, English, Japanese, Korean, French, German, etc.)
When to Use
Use this skill when users need to:
- Parse PDF, Word, Excel, or other document formats
- Extract text content from documents
- Recognize and extract tables
- Analyze document structure (titles, sections, layout)
- Process scanned documents with OCR
- Chunk documents for RAG applications
API Configuration
Environment Variables (Required)
Set these before using the skill:
export BAIDU_DOC_AI_API_KEY="your_api_key"
export BAIDU_DOC_AI_SECRET_KEY="your_secret_key"
Authentication
The skill uses OAuth 2.0 to obtain an access token automatically. Token is valid for 30 days.
Supported Formats
Documents: pdf, doc, docx, xls, xlsx, ppt, pptx, wps, et, dps, csv, txt, html, mhtml, ofd
Images: jpg, jpeg, png, bmp, tiff, tif
Total: 18+ formats
Supported Languages
Chinese, English, Japanese, Korean, French, German, Italian, Portuguese, Spanish, Russian, Dutch, Swedish, Finnish, Danish, Norwegian, Hungarian, Turkish, Polish, Czech, Greek, and more (20+ languages)
Usage
Basic Usage
python3 scripts/baidu_doc_parser.py --file_data <文件的base64编码>
python3 scripts/baidu_doc_parser.py --file_url <文件数据URL>
API Parameters
File Parameters (Required, choose one)
file_url(string): Document URL (publicly accessible)file_data(string): Base64-encoded file datafile_name(string, required): File name with extension
Core Function Parameters
recognize_formula(bool): Recognize formulas in documents (default: false)analysis_chart(bool): Parse statistical charts (default: false)angle_adjust(bool): Auto-rotate images (default: false)parse_image_layout(bool): Return image position info (default: false)
Language and Format Parameters
language_type(string): Recognition language (default: "CHN_ENG")
switch_digital_width(string): Convert number width (default: "auto")
html_table_format(bool): Return tables in HTML format (default: true)
Advanced Parameters
version(string): API version (default: "v2")need_inner_image_data(bool): Include internal image datamerge_tables(bool): Merge related tablesrelevel_titles(bool): Restructure title hierarchyrecognize_seal(bool): Recognize document seals/stampsreturn_span_boxes(bool): Return span bounding boxes
Document Chunking Parameters
return_doc_chunks(dict): Document chunking configuration
switch (bool): Enable chunking (default: false)
- split_type (string): Chunking method - "chunk" (by size) or "mark" (by punctuation)
- separators (list): Punctuation marks for splitting (default: ['。', ';', '!', '?', ';', '!', '?'])
- chunk_size (int): Chunk size in characters (default: -1 for auto)Return Structure
Page Object
Each page contains:
page_id: Page identifierpage_num: Page numbertext: All text content on the pagelayouts: Layout elements (titles, paragraphs, tables, images, etc.)tables: Extracted tablesimages: Extracted images
Layout Types
title: Title (with sub_type: title_1, title_2, title_3, etc.)para: Paragraphtable: Tableimage: Imagehead_tail: Header/footercontents: Table of contentsseal: Seal/stampformula: Mathematical formula
Table Object
layout_id: Table identifiermarkdown: Table content in Markdown formatposition: Bounding box [x, y, width, height]cells: Cell informationmatrix: Cell index matrix (for merged cells)
Chunk Object
chunk_id: Chunk identifiercontent: Chunk contenttype: Chunk type ("text" or "table")meta: Metadata (titles, position, page number)
API Characteristics
Asynchronous Processing
Document parsing is asynchronous:
- Submit request → Get
task_id - Poll for results using
task_id
Polling Recommendations
- Start polling 5-10 seconds after submission
- Polling interval: 5 seconds
- Maximum polling time: 300 seconds
QPS Limits
- Submit request API: 2 QPS
- Query result API: 10 QPS
File Limits
- File size:
- Page limit: Up to 2000 pages for PDF, 200 for others
- Formats: 18+ supported formats
Error Handling
Common error codes:
| Code | Message | Solution |
|---|---|---|
| 110/111 | Access token invalid/expired | Re-obtain access token |
| 216200 | Empty file or URL | Provide file_data or file_url |
| 216201 | File format error | Check file format |
| 216202 | File size error | Reduce file size |
| 282000 | Internal error | Retry or contact support |
| 282003 | Missing parameters | Check required parameters |
| 282007 | Task not exist | Check task_id |
| 282018 | Service busy | Reduce request frequency |
references/error_codes.mdScripts
The skill includes Python scripts for document parsing:
scripts/baidu_doc_parser.py: Main client library- Command-line interface for quick testing
References
references/api_reference.md: Complete API documentationreferences/error_codes.md: Full error code reference