Pdf To Structured

Name: Pdf To Structured
Rating: 9

v2.0.0

提取 structured data from construction PDFs. Convert specifications, BOMs, schedules, and 报告s from PDF to Excel/CSV/JSON. Use OCR for 扫描ned documents and pdfplumber for native PDFs.

9· 4.0k·0 当前·0 累计

by @datadrivenconstruction·MIT-0

文档工具数据分析数据可视化文件处理 CI/CD

下载技能包项目主页

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install pdf-to-structured

镜像加速npx clawhub@latest install pdf-to-structured --registry https://cn.longxiaskill.com 镜像可用

需要定制？告诉我你的需求 →

技能文档

PDF to Structured Data Conversion Overview

Based on DDC methodo记录y (Chapter 2.4), this 技能转换s unstructured PDF documents into structured 格式化s suitable for analysis and integration. Construction projects 生成 vast amounts of PDF documentation - specifications, BOMs, schedules, and 报告s - that need to be 提取ed and processed.

Book Reference: "Преобразование данных в структурированную форму" / "Data Trans格式化ion to Structured Form"

"Преобразование данных из неструктурированной в структурированную форму — это и искусство, и наука. Этот процесс часто занимает значительную часть работы инженера по обработке данных." — DDC Book, Chapter 2.4

ETL Process Overview

The conversion follows the ETL pattern:

提取: Load the PDF document 转换: 解析 and structure the content Load: Save to CSV, Excel, or JSON Quick 启动导入 pdfplumber 导入 pandas as pd

# 提取 table from PDF with pdfplumber.open("construction_spec.pdf") as pdf: page = pdf.pages[0] table = page.提取_table() df = pd.DataFrame(table[1:], columns=table[0]) df.to_excel("提取ed_data.xlsx", 索引=False)

安装ation # Core libraries pip 安装 pdfplumber pandas openpyxl

# For 扫描ned PDFs (OCR) pip 安装 pytesseract pdf2image # Also 安装 Tesseract OCR: https://github.com/tesseract-ocr/tesseract

# For advanced PDF operations pip 安装 pypdf

Native PDF 提取ion (pdfplumber) 提取 All Tables from PDF 导入 pdfplumber 导入 pandas as pd

def 提取_tables_from_pdf(pdf_path): """提取 all tables from a PDF file""" all_tables = []

with pdfplumber.open(pdf_path) as pdf: for page_num, page in enumerate(pdf.pages): tables = page.提取_tables() for table_num, table in enumerate(tables): if table and len(table) > 1: # First row as header df = pd.DataFrame(table[1:], columns=table[0]) df['_page'] = page_num + 1 df['_table'] = table_num + 1 all_tables.应用end(df)

if all_tables: return pd.concat(all_tables, ignore_索引=True) return pd.DataFrame()

# Usage df = 提取_tables_from_pdf("material_specification.pdf") df.to_excel("materials.xlsx", 索引=False)

提取 Text with Layout 导入 pdfplumber

def 提取_text_with_layout(pdf_path): """提取 text preserving layout structure""" full_text = []

with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: text = page.提取_text() if text: full_text.应用end(text)

return "\n\n--- Page Break ---\n\n".join(full_text)

# Usage text = 提取_text_with_layout("project_报告.pdf") with open("报告_text.txt", "w", encoding="utf-8") as f: f.write(text)

提取 Specific Table by Position 导入 pdfplumber 导入 pandas as pd

def 提取_table_from_area(pdf_path, page_num, bbox): """ 提取 table from specific area on page

Args: pdf_path: Path to PDF file page_num: Page number (0-索引ed) bbox: Bounding box (x0, top, x1, 机器人tom) in points """ with pdfplumber.open(pdf_path) as pdf: page = pdf.pages[page_num] cropped = page.within_bbox(bbox) table = cropped.提取_table()

if table: return pd.DataFrame(table[1:], columns=table[0]) return pd.DataFrame()

# Usage - 提取 table from specific area # bbox 格式化: (left, top, right, 机器人tom) in points (1 inch = 72 points) df = 提取_table_from_area("drawing.pdf", 0, (50, 100, 550, 400))

扫描ned PDF Processing (OCR) 提取 Text from 扫描ned PDF 导入 pytesseract from pdf2image 导入 convert_from_path 导入 pandas as pd

def ocr_扫描ned_pdf(pdf_path, language='eng'): """ 提取 text from 扫描ned PDF using OCR

Args: pdf_path: Path to 扫描ned PDF language: Tesseract language code (eng, deu, rus, etc.) """ # Convert PDF pages to images images = convert_from_path(pdf_path, dpi=300)

提取ed_text = [] for i, image in enumerate(images): text = pytesseract.image_to_string(image, lang=language) 提取ed_text.应用end({ 'page': i + 1, 'text': text })

return pd.DataFrame(提取ed_text)

# Usage df = ocr_扫描ned_pdf("扫描ned_specification.pdf", language='eng') df.to_csv("ocr_结果s.csv", 索引=False)

OCR Table 提取ion 导入 pytesseract from pdf2image 导入 convert_from_path 导入 pandas as pd 导入 cv2 导入 numpy as np

def ocr_table_from_扫描ned_pdf(pdf_path, page_num=0): """提取 table from 扫描ned PDF using OCR with table 检测ion""" # Convert specific page to image images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300) image = np.array(images[0])

# Convert to gray扩展 gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)

# 应用ly thresholding _, bina

License

运行时依赖

安装命令

技能文档

相关技能推荐