📦 Upstage Document Classification — Upstage 文档分类

v1.0.0

使用Upstage Document Classification API将文档分类到用户定义的类别中。同时也支持对多文档PDF进行分割。适用于...

0· 0·0 当前·0 累计

by @upstage-deployment (Upstage Deployment)

文档工具 API开发文件处理加密货币区块链

下载技能包

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install upstage-document-classification

镜像加速npx clawhub@latest install upstage-document-classification --registry https://cn.longxiaskill.com镜像同步中

需要定制？告诉我你的需求 →

技能文档

文档分类将文档分类到用户定义的类别中，带有置信度评分。同时支持文档分割，将多文档PDF分割成单个文档。

快速开始

import os
from openai import OpenAI
client = OpenAI(
    api_key=os.environ["UPSTAGE_API_KEY"],
    base_url="https://api.upstage.ai/v1/document-classification"
)
response = client.chat.completions.create(
    model="document-classify",
    messages=[{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {
                "url": "https://example.com/document.pdf"
            }
        }]
    }],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "document-classify",
            "schema": {
                "type": "string",
                "oneOf": [
                    {
                        "const": "invoice",
                        "description": "商业发票，带有项目化收费"
                    },
                    {
                        "const": "receipt",
                        "description": "付款收据"
                    },
                    {
                        "const": "contract",
                        "description": "法律合同或协议"
                    },
                    {
                        "const": "resume",
                        "description": "个人简历或履历"
                    }
                ]
            }
        }
    }
)
print(response.choices[0].message.content)

API 密钥：始终使用 os.environ["UPSTAGE_API_KEY"]。在 console.upstage.ai 获取您的密钥。端点：POST https://api.upstage.ai/v1/document-classification OpenAI SDK 兼容 —— 将 base_url 设置为 https://api.upstage.ai/v1/document-classification。参数

参数	类型	必需	描述
model	string	是	`document-classify` 或 `document-classify-nightly`
messages	array	是	单个用户消息，带有 `image_url`
response_format	object	是	JSON 模式，定义分类类别
split	boolean	否	启用多文档分割（默认：`false`）
split_criteria	array	否	额外的分割标准（与 `split=true` 一起使用）

模式定义（重要）类别使用 oneOf 和 const 值定义。根模式类型必须是 "string"，而不是 "object"。

{
    "type": "json_schema",
    "json_schema": {
        "name": "document-classify",
        "schema": {
            "type": "string",
            "oneOf": [
                {
                    "const": "invoice",
                    "description": "商业发票"
                },
                {
                    "const": "receipt",
                    "description": "付款收据"
                },
                {
                    "const": "other",
                    "description": "其他文档类型"
                }
            ]
        }
    }
}

重要：使用 enum 或基于对象的模式将返回 400 错误。分类 API 需要 oneOf 和 const/description 对。响应结构

{
    "choices": [
        {
            "message": {
                "content": "invoice",
                "tool_calls": [
                    {
                        "function": {
                            "arguments": {
                                "document_type": {
                                    "_value": "invoice",
                                    "confidence_score": 0.99
                                },
                                "pages": [1, 2]
                            }
                        }
                    }
                ]
            }
        }
    ]
}

content：分类类别名称 tool_calls.function.arguments.document_type._value：分类值 tool_calls.function.arguments.document_type.confidence_score：0.0–1.0 tool_calls.function.arguments.pages：页码范围（在分割模式下最有用）文档分割（多文档 PDF）对于包含多个文档类型的 PDF，请设置 extra_body={"split": True} 以将它们分割成组。每个组作为单独的 choices 条目返回。请参阅 references/document-split.md 以获取完整的分割工作流程，包括可选的 split_criteria。输出文件 默认（仅分类）：/.classified.json（例如 /tmp/contract.classified.json） 默认（分割模式）：目录 /.split/，每个检测到的文档一个文件（例如 page-001.invoice.pdf） 覆盖：如果用户指定输出路径，请使用它。始终在响应中打印解析的绝对路径（或路径），以便用户可以找到文件（或文件）。提示 在类别中包含 "other" 以处理未分类的文档。 split 在处理扫描的混合文档文件时很有用，作为文档处理管道的第一步。 常见模式：首先分类 → 使用 upstage-information-extraction 应用类别特定的模式。 * 使用 confidence_score 标记低置信度文档以进行手动审查。详细参考文件内容：references/document-split.md 文档分割（基本 + 带标准），curl 示例

数据来源：ClawHub ↗ · 中文优化：龙虾技能库