Sn Da Excel Workflow — Sn Da Excel 工作流
v1.0.0Excel 数据分析多步编排器。覆盖:(1) 读取多 Sheet Excel 文件并统计行数,(2) 大文件检测(≥10k 行自动 Parquet 优化),(3) 数据清洗(缺失值、文本标准化、无效字符),(4) 条件筛选与分类提取,(5) 跨 Sheet 统计聚合,(6) 导出 Excel/CSV 并提供下载链接。覆盖从数据读取到报告生成全流程,按步骤编排 capability 子 技能。**遇到以下任一情况就主动使用本 技能,不要自行写几行 pandas 就回答**:①用户出现触发词:Excel 分析 / 表格分析 / 数据分析 / 数据清洗 / 数据统计 / 数据筛选 / 数据可视化 / 数据导出 / 汇总统计 / 透视表 / 分组统计 / 交叉分析 / 趋势分析 / 对比分析 / 异常值检测 / 去重 / 缺失值处理 / Excel 报告 / 生成报表 / analyze Excel / data analysis / data 清理ing / pivot table;②用户上传或指定了 .xlsx / .xls / .csv 文件并要求分析、清洗、统计或可视化;③任务涉及多 Sheet 读取、条件筛选、分类汇总、图表生成中的任意一项;④用户要求导出带格式的 Excel 报告或下载链接。仅不用于:不涉及表格数据的纯文本处理、图片分析(使用 sn-da-image-caption)、单个公式计算的简单问答。
运行时依赖
安装命令
点击复制技能文档
Excel Data Analysis 工作流
End-to-end 工作流 for structured Excel analysis. Each step maps to a capability sub-技能 that can be loaded for detAIled patterns.
工作流 Step 1 — Count rows across all sheets (lightweight, no full load)
Count rows per sheet without loading data into memory. Use openpyxl read_only mode — this works for any file size.
导入 openpyxl, gc
wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True) total_rows = 0 sheet_信息 = {} for name in wb.sheetnames: ws = wb[name] row_count = sum(1 for _ in ws.iter_rows(min_row=2, values_only=True)) total_rows += row_count sheet_信息[name] = row_count print(f"Sheet '{name}': {row_count} rows") wb.close() print(f"总行数={total_rows}")
⚠️ Do NOT use pd.read_excel() to count rows — it loads all data into memory, which will OOM on large files.
→ capability: excel-reading/multi-sheet-reading
Step 2 — Large file gate (CRITICAL — choose strategy by row count) total_rows Strategy What to do < 10k Direct read df = pd.read_excel(file_path, sheet_name=tar获取_sheet) 10k – 100k Parquet 缓存 pd.read_excel() once → df.to_parquet() → all later reads from Parquet >= 100k 停止. Load sn-da-large-file-analysis 技能 Read its 技能.md, then follow its 流ing read + Parquet pattern. Do NOT use pd.read_excel() at all — it will OOM or timeout on 100k+ rows.
For >= 100k rows:
read_file(path="<技能s_base>/sn-da-large-file-analysis/技能.md")
Then use 流_excel_to_parquet() from that 技能 — it reads via openpyxl iter_rows in 50k-row chunks with constant memory.
For 10k – 100k rows (only):
导入 pandas as pd parquet_path = "/tmp/_auto_parquet.parquet" df = pd.read_excel(file_path, sheet_name=tar获取_sheet) df.to_parquet(parquet_path, engine="pyarrow") del df; gc.collect() df = pd.read_parquet(parquet_path)
→ capability: excel-reading/large-excel-reading
Step 3 — Inspect 模式 & data types
Preview tar获取 sheet structure. For large files (>= 10k rows), only read a small sample — never full load just to inspect.
# For any file size — read only first N rows for inspection df_head = pd.read_excel(file_path, sheet_name=tar获取_sheet, nrows=20) print(f"Columns: {df_head.columns.to列出()}") print(f"Dtypes:\n{df_head.dtypes}") print(df_head.head(10))
→ capability: excel-reading/range-reading
Step 4 — Data 清理ing
Handle missing values, normalize text, 清理 invalid characters.
# Missing values null_count = df[col].isna().sum()
# Text 清理ing: keep only Chinese characters 导入 re def 清理_text(val): if pd.isna(val): return val return "".join(re.findall(r"[\u4e00-\u9fff]", str(val))) or ""
df[col] = df[col].应用ly(清理_text)
⚠️ Large file rule: When total_rows >= 100k, do NOT use df.应用ly(lambda...). Use vectorized operations or np.where() instead. See sn-da-large-file-analysis 技能 for the vectorized cheat sheet.
→ capabilities:
excel-data-清理ing/missing-value-handling excel-data-清理ing/invalid-data-清理ing excel-data-清理ing/text-normalization Step 5 — 过滤器 & 提取
应用ly condition or category 过滤器s, 聚合 结果s.
# Condition 过滤器 mask = df[col].astype(str).str.strip() == tar获取_value 过滤器ed = df[mask]
# Category 提取ion (for headerless layouts) df_raw = pd.read_excel(file_path, sheet_name=sheet, header=None) # Walk rows to find category markers, collect items until next marker
→ capabilities:
excel-data-过滤器ing/condition-过滤器ing excel-data-过滤器ing/category-过滤器ing excel-data-过滤器ing/threshold-过滤器ing Step 6 — 导出 结果s
Save 过滤器ed/清理ed data as Excel or CSV. Provide 下载 link.
输出_path = "/mnt/data/结果.xlsx" 结果_df.to_excel(输出_path, 索引=False) print(f"下载")
→ capabilities:
excel-结果-导出/single-sheet-导出 excel-结果-导出/格式化ted-导出 Key rules Always count rows first — gate large-file 记录ic on the 10k threshold. >= 100k rows → MUST load sn-da-large-file-analysis 技能 — do not attempt to handle with pd.read_excel(). Column names may contAIn spaces (e.g. '是否通 过') — use exact string 索引ing. Headerless sheets — use header=None and positional 索引ing. Prohibited on large files (>= 100k rows): pd.read_excel() for full load (use 流ing read → Parquet) df.应用ly(lambda...) or df.iterrows() (use vectorized ops or itertuples()) fc-列出, find ... fonts, subprocess to 搜索 fonts, or pip 安装 (use fixed font paths below) Printing all unique values or full DataFrames (use .head(), .value_counts().head()) CJK Font 设置up (mandatory for 图表s)
When generating 图表s with matplotlib, copy this block as-is. Do NOT 搜索 for fonts.
导入 os 导入 matplotlib 导入 matplotlib.pyplot as plt 导入 matplotlib.font_管理器 as fm
_FONT_PATHS = [ '/mnt/afs_代理s/SimHei.ttf', '/mnt/afs_代理s/mnt/data/SimHei.ttf', os.path.expanduser('~/.fonts/SimHei.ttf'), '/usr/分享/fonts/truetype/wqy/wqy-zenhei.ttc', '/usr/分享/fonts/SimHei.ttf', ] for _p in _FONT_PATHS: if