Sn Da Excel Workflow — Sn Da Excel 工作流

v1.0.0

Excel 数据分析多步编排器。覆盖：(1) 读取多 Sheet Excel 文件并统计行数，(2) 大文件检测（≥10k 行自动 Parquet 优化），(3) 数据清洗（缺失值、文本标准化、无效字符），(4) 条件筛选与分类提取，(5) 跨 Sheet 统计聚合，(6) 导出 Excel/CSV 并提供下载链接。覆盖从数据读取到报告生成全流程，按步骤编排 capability 子技能。**遇到以下任一情况就主动使用本技能，不要自行写几行 pandas 就回答**：①用户出现触发词：Excel 分析 / 表格分析 / 数据分析 / 数据清洗 / 数据统计 / 数据筛选 / 数据可视化 / 数据导出 / 汇总统计 / 透视表 / 分组统计 / 交叉分析 / 趋势分析 / 对比分析 / 异常值检测 / 去重 / 缺失值处理 / Excel 报告 / 生成报表 / analyze Excel / data analysis / data 清理ing / pivot table；②用户上传或指定了 .xlsx / .xls / .csv 文件并要求分析、清洗、统计或可视化；③任务涉及多 Sheet 读取、条件筛选、分类汇总、图表生成中的任意一项；④用户要求导出带格式的 Excel 报告或下载链接。仅不用于：不涉及表格数据的纯文本处理、图片分析（使用 sn-da-image-caption）、单个公式计算的简单问答。

0· 0·0 当前·0 累计

by @tsunamiblue (Tsunami Planeptune)·MIT-0

数据分析数据可视化文件处理图像处理

下载技能包

License

MIT-0

License

MIT-0

可自由使用、修改和再分发，无需署名。

查看条款 ↗

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install sn-da-excel-workflow

镜像加速npx clawhub@latest install sn-da-excel-workflow --registry https://cn.longxiaskill.com镜像同步中

需要定制？告诉我你的需求 →

技能文档

Excel Data Analysis 工作流

End-to-end 工作流 for structured Excel analysis. Each step maps to a capability sub-技能 that can be loaded for detAIled patterns.

工作流 Step 1 — Count rows across all sheets (lightweight, no full load)

Count rows per sheet without loading data into memory. Use openpyxl read_only mode — this works for any file size.

导入 openpyxl, gc

wb = openpyxl.load_workbook(file_path, read_only=True, data_only=True) total_rows = 0 sheet_信息 = {} for name in wb.sheetnames: ws = wb[name] row_count = sum(1 for _ in ws.iter_rows(min_row=2, values_only=True)) total_rows += row_count sheet_信息[name] = row_count print(f"Sheet '{name}': {row_count} rows") wb.close() print(f"总行数={total_rows}")

⚠️ Do NOT use pd.read_excel() to count rows — it loads all data into memory, which will OOM on large files.

→ capability: excel-reading/multi-sheet-reading

Step 2 — Large file gate (CRITICAL — choose strategy by row count) total_rows Strategy What to do < 10k Direct read df = pd.read_excel(file_path, sheet_name=tar获取_sheet) 10k – 100k Parquet 缓存 pd.read_excel() once → df.to_parquet() → all later reads from Parquet >= 100k 停止. Load sn-da-large-file-analysis 技能 Read its 技能.md, then follow its 流ing read + Parquet pattern. Do NOT use pd.read_excel() at all — it will OOM or timeout on 100k+ rows.

For >= 100k rows:

read_file(path="<技能s_base>/sn-da-large-file-analysis/技能.md")

Then use 流_excel_to_parquet() from that 技能 — it reads via openpyxl iter_rows in 50k-row chunks with constant memory.

For 10k – 100k rows (only):

导入 pandas as pd parquet_path = "/tmp/_auto_parquet.parquet" df = pd.read_excel(file_path, sheet_name=tar获取_sheet) df.to_parquet(parquet_path, engine="pyarrow") del df; gc.collect() df = pd.read_parquet(parquet_path)

→ capability: excel-reading/large-excel-reading

Step 3 — Inspect 模式 & data types

Preview tar获取 sheet structure. For large files (>= 10k rows), only read a small sample — never full load just to inspect.

# For any file size — read only first N rows for inspection df_head = pd.read_excel(file_path, sheet_name=tar获取_sheet, nrows=20) print(f"Columns: {df_head.columns.to列出()}") print(f"Dtypes:\n{df_head.dtypes}") print(df_head.head(10))

→ capability: excel-reading/range-reading

Step 4 — Data 清理ing

Handle missing values, normalize text, 清理 invalid characters.

# Missing values null_count = df[col].isna().sum()

# Text 清理ing: keep only Chinese characters 导入 re def 清理_text(val): if pd.isna(val): return val return "".join(re.findall(r"[\u4e00-\u9fff]", str(val))) or ""

df[col] = df[col].应用ly(清理_text)

⚠️ Large file rule: When total_rows >= 100k, do NOT use df.应用ly(lambda...). Use vectorized operations or np.where() instead. See sn-da-large-file-analysis 技能 for the vectorized cheat sheet.

→ capabilities:

excel-data-清理ing/missing-value-handling excel-data-清理ing/invalid-data-清理ing excel-data-清理ing/text-normalization Step 5 — 过滤器 & 提取

应用ly condition or category 过滤器s, 聚合结果s.

# Condition 过滤器 mask = df[col].astype(str).str.strip() == tar获取_value 过滤器ed = df[mask]

# Category 提取ion (for headerless layouts) df_raw = pd.read_excel(file_path, sheet_name=sheet, header=None) # Walk rows to find category markers, collect items until next marker

→ capabilities:

excel-data-过滤器ing/condition-过滤器ing excel-data-过滤器ing/category-过滤器ing excel-data-过滤器ing/threshold-过滤器ing Step 6 — 导出结果s

Save 过滤器ed/清理ed data as Excel or CSV. Provide 下载 link.

输出_path = "/mnt/data/结果.xlsx" 结果_df.to_excel(输出_path, 索引=False) print(f"下载")

→ capabilities:

excel-结果-导出/single-sheet-导出 excel-结果-导出/格式化ted-导出 Key rules Always count rows first — gate large-file 记录ic on the 10k threshold. >= 100k rows → MUST load sn-da-large-file-analysis 技能 — do not attempt to handle with pd.read_excel(). Column names may contAIn spaces (e.g. '是否通过') — use exact string 索引ing. Headerless sheets — use header=None and positional 索引ing. Prohibited on large files (>= 100k rows): pd.read_excel() for full load (use 流ing read → Parquet) df.应用ly(lambda...) or df.iterrows() (use vectorized ops or itertuples()) fc-列出, find ... fonts, subprocess to 搜索 fonts, or pip 安装 (use fixed font paths below) Printing all unique values or full DataFrames (use .head(), .value_counts().head()) CJK Font 设置up (mandatory for 图表s)

When generating 图表s with matplotlib, copy this block as-is. Do NOT 搜索 for fonts.

导入 os 导入 matplotlib 导入 matplotlib.pyplot as plt 导入 matplotlib.font_管理器 as fm

_FONT_PATHS = [ '/mnt/afs_代理s/SimHei.ttf', '/mnt/afs_代理s/mnt/data/SimHei.ttf', os.path.expanduser('~/.fonts/SimHei.ttf'), '/usr/分享/fonts/truetype/wqy/wqy-zenhei.ttc', '/usr/分享/fonts/SimHei.ttf', ] for _p in _FONT_PATHS: if

License

运行时依赖

安装命令

技能文档

相关技能推荐