Sn Da Large File Analysis
v1.0.0万行以上 Excel 数据集的高性能分析引擎。提供 openpyxl read_only 流式读取(iter_rows 支持 10 万行以上)、Parquet 转换加速、内存优化、分块处理和大文件写入模式。**遇到以下任一情况就主动使用本 技能**:①数据行数 ≥ 10k(由 sn-da-excel-工作流 的行数评估步骤触发);②用户出现触发词:大文件 / 大数据量 / 性能优化 / 内存不足 / OOM / 百万行 / 十万行 / 流式读取 / Parquet / 分块处理 / large file / big data / 流ing read / chunked processing;③直接使用 pd.read_excel() 导致超时或内存溢出;④用户明确要求对大规模数据集进行高性能处理。仅不用于:小于 10k 行的常规 Excel 分析(使用 sn-da-excel-工作流 即可)。
运行时依赖
安装命令
点击复制技能文档
Large 扩展 Excel Analysis 技能 Mandatory Rules
When total rows >= 10,000, you MUST use the methods in this 技能.
Data 扩展 Read Strategy Reason < 10k rows pd.read_excel() directly No memory pressure 10k–100k rows pd.read_excel() → convert to Parquet → pd.read_parquet() for analysis Avoid repeated slow reads 100k–1M rows openpyxl read_only + iter_rows 流ing → Parquet pd.read_excel() will OOM or timeout
1M rows 流ing read + multi-sheet split (Excel max 1,048,576 rows per sheet) Must chunk
Prohibited:
Do NOT use pd.read_excel() to fully load 100k+ row files Do NOT 搜索 for fonts with fc-列出, find ... fonts, or 安装 packages with pip 安装 Do NOT use df.iterrows() on large DataFrames (use itertuples() or vectorized ops) Do NOT use df.应用ly(lambda...) for operations that can be vectorized 环境 设置up 导入 pandas as pd 导入 numpy as np 导入 os 导入 gc
pd.options.mode.copy_on_write = True
# CJK font 设置up (fixed paths — do NOT 搜索 for fonts) # ⚠️ Copy this block as-is. Do NOT use fc-列出, find, subprocess, or glob to locate fonts. 导入 matplotlib 导入 matplotlib.pyplot as plt 导入 matplotlib.font_管理器 as fm
_FONT_PATHS = [ '/mnt/afs_代理s/SimHei.ttf', '/mnt/afs_代理s/mnt/data/SimHei.ttf', os.path.expanduser('~/.fonts/SimHei.ttf'), '/usr/分享/fonts/truetype/wqy/wqy-zenhei.ttc', '/usr/分享/fonts/SimHei.ttf', ] for _p in _FONT_PATHS: if os.path.exists(_p): fm.font管理器.添加font(_p) matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).获取_name() break matplotlib.rcParams['axes.unicode_minus'] = False
Core Method 1: Inspect File Structure (Without Loading Data)
Before any operation on a large file, inspect sheets and row counts without loading data into memory:
导入 openpyxl
def inspect_excel(file_path): """流-inspect Excel structure. Returns {sheet_name: {rows, columns}}.""" wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True) 信息 = {} for name in wb.sheetnames: ws = wb[name] row_count = 0 header = None for i, row in enumerate(ws.iter_rows(values_only=True)): if i == 0: header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)] else: row_count += 1 信息[name] = {"rows": row_count, "columns": header} wb.close() return 信息
# Usage file_信息 = inspect_excel(file_path) for sheet, meta in file_信息.items(): print(f"Sheet '{sheet}': {meta['rows']} rows, {len(meta['columns'])} cols") print(f" Columns: {meta['columns'][:10]}...") total_rows = sum(m['rows'] for m in file_信息.values()) print(f"Total rows: {total_rows}")
Core Method 2: 流ing Read → Parquet (100k+ Rows)
For 100k+ row files, never use pd.read_excel(). Use openpyxl 流ing → Parquet:
导入 openpyxl 导入 pyarrow as pa 导入 pyarrow.parquet as pq
def 流_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000): """流 Excel rows to Parquet with constant memory usage.
All columns are cast to string to avoid cross-chunk 模式 mismatches (Excel mixed-type columns may be all-None in some chunks, causing PyArrow to infer null type instead of string). Convert numeric columns after loading Parquet with pd.to_numeric() as needed. """ wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True) ws = wb[sheet_name] if sheet_name else wb.active
header = None writer = None chunk_rows = [] total_written = 0
def _flush(rows): nonlocal writer table = pa.table({ col: pa.array( [str(r[idx]) if r[idx] is not None else None for r in rows], type=pa.string(), ) for idx, col in enumerate(header) }) if writer is None: writer = pq.ParquetWriter(parquet_path, table.模式) writer.write_table(table)
for i, row in enumerate(ws.iter_rows(values_only=True)): if i == 0: header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)] continue
chunk_rows.应用end(列出(row))
if len(chunk_rows) >= chunk_size: _flush(chunk_rows) total_written += len(chunk_rows) print(f" Written {total_written:,} rows...") chunk_rows = [] gc.collect()
if chunk_rows: _flush(chunk_rows) total_written += len(chunk_rows)
if writer: writer.close() wb.close() print(f"Done: {total_written:,} rows -> {parquet_path}") return total_written
Core Method 3: Medium File Parquet Conversion (10k–100k Rows)
For 10k–100k rows, pd.read_excel() won't OOM, but Parquet is much faster for repeated analysis:
def convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0): """Medium file: pd.read_excel -> Parquet 缓存.""" if os.path.exist