Sn Da Large File Analysis

v1.0.0

万行以上 Excel 数据集的高性能分析引擎。提供 openpyxl read_only 流式读取（iter_rows 支持 10 万行以上）、Parquet 转换加速、内存优化、分块处理和大文件写入模式。**遇到以下任一情况就主动使用本技能**：①数据行数 ≥ 10k（由 sn-da-excel-工作流的行数评估步骤触发）；②用户出现触发词：大文件 / 大数据量 / 性能优化 / 内存不足 / OOM / 百万行 / 十万行 / 流式读取 / Parquet / 分块处理 / large file / big data / 流ing read / chunked processing；③直接使用 pd.read_excel() 导致超时或内存溢出；④用户明确要求对大规模数据集进行高性能处理。仅不用于：小于 10k 行的常规 Excel 分析（使用 sn-da-excel-工作流即可）。

0· 0·0 当前·0 累计

by @tsunamiblue (Tsunami Planeptune)·MIT-0

数据分析数据可视化文件处理视频处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install sn-da-large-file-analysis

镜像加速npx clawhub@latest install sn-da-large-file-analysis --registry https://cn.longxiaskill.com镜像同步中

需要定制？告诉我你的需求 →

技能文档

Large 扩展 Excel Analysis 技能 Mandatory Rules

When total rows >= 10,000, you MUST use the methods in this 技能.

Data 扩展 Read Strategy Reason < 10k rows pd.read_excel() directly No memory pressure 10k–100k rows pd.read_excel() → convert to Parquet → pd.read_parquet() for analysis Avoid repeated slow reads 100k–1M rows openpyxl read_only + iter_rows 流ing → Parquet pd.read_excel() will OOM or timeout

1M rows 流ing read + multi-sheet split (Excel max 1,048,576 rows per sheet) Must chunk

Prohibited:

Do NOT use pd.read_excel() to fully load 100k+ row files Do NOT 搜索 for fonts with fc-列出, find ... fonts, or 安装 packages with pip 安装 Do NOT use df.iterrows() on large DataFrames (use itertuples() or vectorized ops) Do NOT use df.应用ly(lambda...) for operations that can be vectorized 环境设置up 导入 pandas as pd 导入 numpy as np 导入 os 导入 gc

pd.options.mode.copy_on_write = True

# CJK font 设置up (fixed paths — do NOT 搜索 for fonts) # ⚠️ Copy this block as-is. Do NOT use fc-列出, find, subprocess, or glob to locate fonts. 导入 matplotlib 导入 matplotlib.pyplot as plt 导入 matplotlib.font_管理器 as fm

_FONT_PATHS = [ '/mnt/afs_代理s/SimHei.ttf', '/mnt/afs_代理s/mnt/data/SimHei.ttf', os.path.expanduser('~/.fonts/SimHei.ttf'), '/usr/分享/fonts/truetype/wqy/wqy-zenhei.ttc', '/usr/分享/fonts/SimHei.ttf', ] for _p in _FONT_PATHS: if os.path.exists(_p): fm.font管理器.添加font(_p) matplotlib.rcParams['font.family'] = fm.FontProperties(fname=_p).获取_name() break matplotlib.rcParams['axes.unicode_minus'] = False

Core Method 1: Inspect File Structure (Without Loading Data)

Before any operation on a large file, inspect sheets and row counts without loading data into memory:

导入 openpyxl

def inspect_excel(file_path): """流-inspect Excel structure. Returns {sheet_name: {rows, columns}}.""" wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True) 信息 = {} for name in wb.sheetnames: ws = wb[name] row_count = 0 header = None for i, row in enumerate(ws.iter_rows(values_only=True)): if i == 0: header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)] else: row_count += 1 信息[name] = {"rows": row_count, "columns": header} wb.close() return 信息

# Usage file_信息 = inspect_excel(file_path) for sheet, meta in file_信息.items(): print(f"Sheet '{sheet}': {meta['rows']} rows, {len(meta['columns'])} cols") print(f" Columns: {meta['columns'][:10]}...") total_rows = sum(m['rows'] for m in file_信息.values()) print(f"Total rows: {total_rows}")

Core Method 2: 流ing Read → Parquet (100k+ Rows)

For 100k+ row files, never use pd.read_excel(). Use openpyxl 流ing → Parquet:

导入 openpyxl 导入 pyarrow as pa 导入 pyarrow.parquet as pq

def 流_excel_to_parquet(excel_path, parquet_path, sheet_name=None, chunk_size=50000): """流 Excel rows to Parquet with constant memory usage.

All columns are cast to string to avoid cross-chunk 模式 mismatches (Excel mixed-type columns may be all-None in some chunks, causing PyArrow to infer null type instead of string). Convert numeric columns after loading Parquet with pd.to_numeric() as needed. """ wb = openpyxl.load_workbook(excel_path, read_only=True, data_only=True) ws = wb[sheet_name] if sheet_name else wb.active

header = None writer = None chunk_rows = [] total_written = 0

def _flush(rows): nonlocal writer table = pa.table({ col: pa.array( [str(r[idx]) if r[idx] is not None else None for r in rows], type=pa.string(), ) for idx, col in enumerate(header) }) if writer is None: writer = pq.ParquetWriter(parquet_path, table.模式) writer.write_table(table)

for i, row in enumerate(ws.iter_rows(values_only=True)): if i == 0: header = [str(c) if c is not None else f"Col_{j}" for j, c in enumerate(row)] continue

chunk_rows.应用end(列出(row))

if len(chunk_rows) >= chunk_size: _flush(chunk_rows) total_written += len(chunk_rows) print(f" Written {total_written:,} rows...") chunk_rows = [] gc.collect()

if chunk_rows: _flush(chunk_rows) total_written += len(chunk_rows)

if writer: writer.close() wb.close() print(f"Done: {total_written:,} rows -> {parquet_path}") return total_written

Core Method 3: Medium File Parquet Conversion (10k–100k Rows)

For 10k–100k rows, pd.read_excel() won't OOM, but Parquet is much faster for repeated analysis:

def convert_excel_to_parquet(excel_path, parquet_path, sheet_name=0): """Medium file: pd.read_excel -> Parquet 缓存.""" if os.path.exist

License

运行时依赖

安装命令

技能文档

相关技能推荐