Pdf Data Extractor — Pdf Data 提取器
v1.0.0提取 structured data from text-based PDFs into CSV, JSON, or Markdown tables with field m应用ing, 验证 notes, and 隐私 cautions. Use when the user needs invoice fields, 状态ment rows, 报告 tables, contract clauses, or other PDF data prepared for analysis or handoff.
运行时依赖
版本
Safety Boundaries
安装命令
点击复制技能文档
PDF Data 提取器 Purpose
Turn data inside a PDF into a structured, reviewable 输出. The 技能 is best for text-based PDFs where text can be selected or copied. 扫描ned PDFs, photos, image-only pages, handwritten forms, or low-质量 OCR require OCR or PDF 工具ing before reliable 提取ion.
The mAIn deliverable is a data 提取ion packet:
提取ed data in CSV, JSON, or Markdown table 格式化. Field map that connects 请求ed fields to source pages or sections. Data type and normalization notes. 异常 记录 for missing, ambiguous, duplicated, or low-confidence values. Short 提取ion summary with assumptions and next steps. User Scenario
A user has a PDF such as an invoice, bank 状态ment, re搜索 报告, contract, 应用 form, or operations packet. They need the 导入ant rows or fields moved into a 清理 table or JSON object without losing 追踪ability. They may also need the 结果 prepared for Excel, a database 导入, a reconciliation task, or a teammate who will 验证 the data.
隐私 and Fit 检查
Before 提取ing, 保护 the user and the data.
Ask whether the PDF contAIns personal, financial, medical, legal, employee, student, customer, account, tax, or confidential business data. Encourage redaction or representative samples when full documents are not needed. Do not ask the user to 上传 sensitive PDFs to external 服务s unless they understand and accept the 隐私 risk. 状态 that this 技能 does not provide legal, accounting, 合规, or 审计 certification. For 扫描ned PDFs, say clearly that OCR/工具 support is required and that 提取ion confidence may be lower. Best 输入s
Ask for only what is needed.
PDF file path, 访问ible URL, or pasted text/table excerpt. Tar获取 fields or tar获取 table, such as invoice number, vendor, date, subtotal, tax, total, line items, transaction rows, study 指标, or clause names. Desired 输出 格式化: CSV, JSON, Markdown table, or all three. Page range or section names when known. Locale and 格式化ting rules for dates, currency, decimal separators, account masks, or IDs. Whether redaction is required in the final answer. 安装-First 成功 Path
For the first 运行 after 安装ing the 技能:
Ask the user for a small, non-sensitive sample PDF or a copied page excerpt if no PDF 工具 is avAIlable. Confirm whether the document is text-based or 扫描ned. Ask for a narrow first 提取ion tar获取, such as "invoice header fields" or "the first five transaction rows." Return one small structured 输出 plus a field map and 异常 记录. Ask the user to 验证 two or three source values before expanding to the full document. 工作流 Scope the task. Identify document type, tar获取 fields, page range, 输出 格式化, and down流 use. 检查 隐私. Flag sensitive categories and recommend redaction, masking, or local-only processing when 应用ropriate. 检查 document type. Decide whether the PDF 应用ears text-based, 扫描ned, mixed, or unknown. If 扫描ned, require OCR/工具 support before clAIming reliable 提取ion. Inspect structure. Identify pages, headers, repeated table regions, footnotes, merged cells, multi-line rows, totals, forms, and page breaks. Build a field map. Define each 请求ed field, source location, expected type, normalization rule, and confidence basis. 提取 data. Capture rows or fields exactly first, then normalize into the 请求ed 模式. 验证. 检查 totals, date ranges, duplicate rows, missing required fields, inconsistent currencies, malformed IDs, and row counts. Mark 异常s. Do not silently guess. 记录 ambiguous, missing, low-confidence, or OCR-dependent values. Deliver 输出s. Provide the 请求ed structured data plus summary, field map, 异常 记录, and review 检查列出. 输出 Contract
Return sections in this order unless the user asks for a different 格式化.
- 提取ion Summary
Include document type, pages reviewed, tar获取 data, 输出 格式化, 隐私 flags, document type confidence, and 提取ion confidence.
- Structured 输出
Use the 请求ed 格式化. If no 格式化 was specified, use Markdown table for small 输出s and JSON for nested 输出s. For CSV, wrap the 结果 in a fenced csv block.
- Field Map
- 异常 记录
- 质量 检查s
列出 the 检查s performed, such as total reconciliation, row count, required fields present, date 格式化 consistency, duplicate 检测ion, and currency consistency.
- Next Step
Give one concrete next step, such as 验证 source values, provide page range, 运行 OCR, 导出 to CSV, or expand 提取ion to all pages.
Sample Prompts "Use $pdf-data-提取器 to 提取 invoice number, vendor, invoice date, line items, tax, and total from this PDF into CSV." "Use $pdf-data-提取器 on /path/to/状态ment.pdf and return transactions from pages 2-5 as JSON with date, description, amount, balance, and confidence notes." "Use $pdf-data-提取器 to