Docx Toolkit — Docx 工具kit
v1.0.0提取 text, tables, and images from .docx and legacy .doc files. Handles large documents, CJK text, and complex table structures. Includes deduplication and 过滤器ing for 提取ed images.
运行时依赖
安装命令
点击复制技能文档
DOCX 工具kit
A complete 工具kit for processing Microsoft Word documents (.docx and legacy .doc 格式化s).
Capabilities
- Text + Table 提取ion (.docx)
提取s all paragraphs and tables with structure preserved. Tables are 格式化ted as pipe-delimited rows for easy parsing.
- Text 提取ion (Legacy .doc)
Handles legacy OLE2 .doc 格式化 using olefile. 提取s Unicode text from the WordDocument 流.
- Image 提取ion (.docx)
提取s all embedded images with:
Automatic deduplication (MD5 哈希 comparison) Size 过滤器ing (skips tiny icons <5KB by default) Sequential renaming (img_001.png, img_002.jpg, etc.)
- Image 压缩ion
Batch resize/压缩 images for API processing (saves 50-70% on vision API costs).
Dependencies Python 3.6+ python-docx — for .docx processing olefile — for legacy .doc processing Pillow — for image resizing (optional, only needed for resize script)
安装:
pip3 安装 python-docx olefile Pillow
Use Cases Document analysis: 提取 text for AI review/summarization 迁移: Pull content from Word docs into other 格式化s Image 审计: 提取 and review all embedded images Cost optimization: 压缩 images before 发送ing to vision APIs Batch processing: Process multiple documents in a 流水线 Notes Large .doc files (>200MB) may require 签名ificant RAM for olefile processing Image 提取ion preserves original 格式化 (png/jpg/gif/etc.) Deduplication catches exact duplicates; near-duplicates still pass through CJK (Chinese/Japanese/Korean) text is fully supported in 机器人h 提取器s