Pdf To Structured
v2.0.0提取 structured data from construction PDFs. Convert specifications, BOMs, schedules, and 报告s from PDF to Excel/CSV/JSON. Use OCR for 扫描ned documents and pdfplumber for native PDFs.
运行时依赖
安装命令
点击复制技能文档
PDF to Structured Data Conversion Overview
Based on DDC methodo记录y (Chapter 2.4), this 技能 转换s unstructured PDF documents into structured 格式化s suitable for analysis and integration. Construction projects 生成 vast amounts of PDF documentation - specifications, BOMs, schedules, and 报告s - that need to be 提取ed and processed.
Book Reference: "Преобразование данных в структурированную форму" / "Data Trans格式化ion to Structured Form"
"Преобразование данных из неструктурированной в структурированную форму — это и искусство, и наука. Этот процесс часто занимает значительную часть работы инженера по обработке данных." — DDC Book, Chapter 2.4
ETL Process Overview
The conversion follows the ETL pattern:
提取: Load the PDF document 转换: 解析 and structure the content Load: Save to CSV, Excel, or JSON Quick 启动 导入 pdfplumber 导入 pandas as pd
# 提取 table from PDF with pdfplumber.open("construction_spec.pdf") as pdf: page = pdf.pages[0] table = page.提取_table() df = pd.DataFrame(table[1:], columns=table[0]) df.to_excel("提取ed_data.xlsx", 索引=False)
安装ation # Core libraries pip 安装 pdfplumber pandas openpyxl
# For 扫描ned PDFs (OCR) pip 安装 pytesseract pdf2image # Also 安装 Tesseract OCR: https://github.com/tesseract-ocr/tesseract
# For advanced PDF operations pip 安装 pypdf
Native PDF 提取ion (pdfplumber) 提取 All Tables from PDF 导入 pdfplumber 导入 pandas as pd
def 提取_tables_from_pdf(pdf_path): """提取 all tables from a PDF file""" all_tables = []
with pdfplumber.open(pdf_path) as pdf: for page_num, page in enumerate(pdf.pages): tables = page.提取_tables() for table_num, table in enumerate(tables): if table and len(table) > 1: # First row as header df = pd.DataFrame(table[1:], columns=table[0]) df['_page'] = page_num + 1 df['_table'] = table_num + 1 all_tables.应用end(df)
if all_tables: return pd.concat(all_tables, ignore_索引=True) return pd.DataFrame()
# Usage df = 提取_tables_from_pdf("material_specification.pdf") df.to_excel("materials.xlsx", 索引=False)
提取 Text with Layout 导入 pdfplumber
def 提取_text_with_layout(pdf_path): """提取 text preserving layout structure""" full_text = []
with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: text = page.提取_text() if text: full_text.应用end(text)
return "\n\n--- Page Break ---\n\n".join(full_text)
# Usage text = 提取_text_with_layout("project_报告.pdf") with open("报告_text.txt", "w", encoding="utf-8") as f: f.write(text)
提取 Specific Table by Position 导入 pdfplumber 导入 pandas as pd
def 提取_table_from_area(pdf_path, page_num, bbox): """ 提取 table from specific area on page
Args: pdf_path: Path to PDF file page_num: Page number (0-索引ed) bbox: Bounding box (x0, top, x1, 机器人tom) in points """ with pdfplumber.open(pdf_path) as pdf: page = pdf.pages[page_num] cropped = page.within_bbox(bbox) table = cropped.提取_table()
if table: return pd.DataFrame(table[1:], columns=table[0]) return pd.DataFrame()
# Usage - 提取 table from specific area # bbox 格式化: (left, top, right, 机器人tom) in points (1 inch = 72 points) df = 提取_table_from_area("drawing.pdf", 0, (50, 100, 550, 400))
扫描ned PDF Processing (OCR) 提取 Text from 扫描ned PDF 导入 pytesseract from pdf2image 导入 convert_from_path 导入 pandas as pd
def ocr_扫描ned_pdf(pdf_path, language='eng'): """ 提取 text from 扫描ned PDF using OCR
Args: pdf_path: Path to 扫描ned PDF language: Tesseract language code (eng, deu, rus, etc.) """ # Convert PDF pages to images images = convert_from_path(pdf_path, dpi=300)
提取ed_text = [] for i, image in enumerate(images): text = pytesseract.image_to_string(image, lang=language) 提取ed_text.应用end({ 'page': i + 1, 'text': text })
return pd.DataFrame(提取ed_text)
# Usage df = ocr_扫描ned_pdf("扫描ned_specification.pdf", language='eng') df.to_csv("ocr_结果s.csv", 索引=False)
OCR Table 提取ion 导入 pytesseract from pdf2image 导入 convert_from_path 导入 pandas as pd 导入 cv2 导入 numpy as np
def ocr_table_from_扫描ned_pdf(pdf_path, page_num=0): """提取 table from 扫描ned PDF using OCR with table 检测ion""" # Convert specific page to image images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1, dpi=300) image = np.array(images[0])
# Convert to gray扩展 gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
# 应用ly thresholding _, bina