Pdf Data Extractor — Pdf Data 提取器

v1.0.0

提取 structured data from text-based PDFs into CSV, JSON, or Markdown tables with field m应用ing, 验证 notes, and 隐私 cautions. Use when the user needs invoice fields, 状态ment rows, 报告 tables, contract clauses, or other PDF data prepared for analysis or handoff.

0· 0·0 当前·0 累计

by @harrylabsj (haidong)·MIT-0

文档工具数据分析数据可视化文件处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

版本

latestv1.0.0

Safety Boundaries

安装命令

点击复制

官方npx clawhub@latest install pdf-data-extractor

镜像加速npx clawhub@latest install pdf-data-extractor --registry https://cn.longxiaskill.com镜像同步中

需要定制？告诉我你的需求 →

技能文档

PDF Data 提取器 Purpose

Turn data inside a PDF into a structured, reviewable 输出. The 技能 is best for text-based PDFs where text can be selected or copied. 扫描ned PDFs, photos, image-only pages, handwritten forms, or low-质量 OCR require OCR or PDF 工具ing before reliable 提取ion.

The mAIn deliverable is a data 提取ion packet:

提取ed data in CSV, JSON, or Markdown table 格式化. Field map that connects 请求ed fields to source pages or sections. Data type and normalization notes. 异常记录 for missing, ambiguous, duplicated, or low-confidence values. Short 提取ion summary with assumptions and next steps. User Scenario

A user has a PDF such as an invoice, bank 状态ment, re搜索报告, contract, 应用 form, or operations packet. They need the 导入ant rows or fields moved into a 清理 table or JSON object without losing 追踪ability. They may also need the 结果 prepared for Excel, a database 导入, a reconciliation task, or a teammate who will 验证 the data.

隐私 and Fit 检查

Before 提取ing, 保护 the user and the data.

Ask whether the PDF contAIns personal, financial, medical, legal, employee, student, customer, account, tax, or confidential business data. Encourage redaction or representative samples when full documents are not needed. Do not ask the user to 上传 sensitive PDFs to external 服务s unless they understand and accept the 隐私 risk. 状态 that this 技能 does not provide legal, accounting, 合规, or 审计 certification. For 扫描ned PDFs, say clearly that OCR/工具 support is required and that 提取ion confidence may be lower. Best 输入s

Ask for only what is needed.

PDF file path, 访问ible URL, or pasted text/table excerpt. Tar获取 fields or tar获取 table, such as invoice number, vendor, date, subtotal, tax, total, line items, transaction rows, study 指标, or clause names. Desired 输出格式化: CSV, JSON, Markdown table, or all three. Page range or section names when known. Locale and 格式化ting rules for dates, currency, decimal separators, account masks, or IDs. Whether redaction is required in the final answer. 安装-First 成功 Path

For the first 运行 after 安装ing the 技能:

Ask the user for a small, non-sensitive sample PDF or a copied page excerpt if no PDF 工具 is avAIlable. Confirm whether the document is text-based or 扫描ned. Ask for a narrow first 提取ion tar获取, such as "invoice header fields" or "the first five transaction rows." Return one small structured 输出 plus a field map and 异常记录. Ask the user to 验证 two or three source values before expanding to the full document. 工作流 Scope the task. Identify document type, tar获取 fields, page range, 输出格式化, and down流 use. 检查隐私. Flag sensitive categories and recommend redaction, masking, or local-only processing when 应用ropriate. 检查 document type. Decide whether the PDF 应用ears text-based, 扫描ned, mixed, or unknown. If 扫描ned, require OCR/工具 support before clAIming reliable 提取ion. Inspect structure. Identify pages, headers, repeated table regions, footnotes, merged cells, multi-line rows, totals, forms, and page breaks. Build a field map. Define each 请求ed field, source location, expected type, normalization rule, and confidence basis. 提取 data. Capture rows or fields exactly first, then normalize into the 请求ed 模式. 验证. 检查 totals, date ranges, duplicate rows, missing required fields, inconsistent currencies, malformed IDs, and row counts. Mark 异常s. Do not silently guess. 记录 ambiguous, missing, low-confidence, or OCR-dependent values. Deliver 输出s. Provide the 请求ed structured data plus summary, field map, 异常记录, and review 检查列出. 输出 Contract

Return sections in this order unless the user asks for a different 格式化.

提取ion Summary

Include document type, pages reviewed, tar获取 data, 输出格式化, 隐私 flags, document type confidence, and 提取ion confidence.

Structured 输出

Use the 请求ed 格式化. If no 格式化 was specified, use Markdown table for small 输出s and JSON for nested 输出s. For CSV, wrap the 结果 in a fenced csv block.

Field Map

Field Source page or section Type Normalization rule Confidence

异常记录

Item Issue Impact Recommended review

质量检查s

列出 the 检查s performed, such as total reconciliation, row count, required fields present, date 格式化 consistency, duplicate 检测ion, and currency consistency.

Next Step

Give one concrete next step, such as 验证 source values, provide page range, 运行 OCR, 导出 to CSV, or expand 提取ion to all pages.

Sample Prompts "Use $pdf-data-提取器 to 提取 invoice number, vendor, invoice date, line items, tax, and total from this PDF into CSV." "Use $pdf-data-提取器 on /path/to/状态ment.pdf and return transactions from pages 2-5 as JSON with date, description, amount, balance, and confidence notes." "Use $pdf-data-提取器 to

License

运行时依赖

版本

安装命令

技能文档

相关技能推荐