📦 Upstage Information Extraction — Upstage 信息提取

v1.0.0

提取 specific named fields from documents using Upstage In格式化ion 提取ion API with custom JSON 模式s (同步/a同步) or prebuilt 模型s for receipts,...

0· 0·0 当前·0 累计

by @upstage-deployment (Upstage Deployment)

文档工具数据与API 数据库 API开发存储部署

下载技能包

运行时依赖

无特殊依赖

安装命令

点击复制

官方npx clawhub@latest install upstage-information-extraction

镜像加速npx clawhub@latest install upstage-information-extraction --registry https://cn.longxiaskill.com镜像同步中

需要定制？告诉我你的需求 →

技能文档

信息提取从文档中使用自定义JSON模式提取结构化数据。还支持预建模型用于收据、发票和贸易文件。快速开始

import os
from openai import OpenAI
client = OpenAI(
    api_key=os.environ["UPSTAGE_API_KEY"],
    base_url="https://api.upstage.ai/v1/information-extraction"
)
response = client.chat.completions.create(
    model="information-extract",
    messages=[{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {
                "url": "https://example.com/invoice.pdf"
            }
        }]
    }],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice_schema",
            "schema": {
                "type": "object",
                "properties": {
                    "invoice_number": {
                        "type": "string",
                        "description": "发票ID"
                    },
                    "total_amount": {
                        "type": "string",
                        "description": "总金额（含货币）"
                    },
                    "date": {
                        "type": "string",
                        "description": "发票日期（YYYY-MM-DD）"
                    }
                }
            }
        }
    }
)
print(response.choices[0].message.content)

API密钥：始终使用os.environ["UPSTAGE_API_KEY"]。在console.upstage.ai获取您的密钥。端点模式 | 端点 | 同步/异步 ---|---|--- 信息提取 | https://api.upstage.ai/v1/information-extraction | 同步信息提取 | https://api.upstage.ai/v1/information-extraction/async | 异步任务状态 | https://api.upstage.ai/v1/information-extraction/jobs/{job_id} | 获取 OpenAI SDK兼容：设置base_url为https://api.upstage.ai/v1/information-extraction 参数参数 | 类型 | 必需 | 描述 ---|---|---|--- model | string | 是 | information-extract或information-extract-nightly messages | array | 是 | 单个用户消息，包含image_url response_format | object | 是 | 提取模式（JSON模式格式） mode | string | 否 | standard（默认）或enhanced location | boolean | 否 | 返回坐标（默认：false） confidence | boolean | 否 | 返回置信度评分（默认：false） split | boolean | 否 | 拆分多文档文件（默认：false）限制项目 | 同步 | 异步 ---|---|--- 最大页数 | 100 | 1,000 最大属性数 | 100 | 5,000 最大模式字符数 | 15,000 | 120,000 模式规则顶级属性：仅允许字符串、整数、数字、数组（不允许对象）不允许嵌套数组所有属性名称的总字符长度必须小于10,000 自动模式生成：使用upstage-schema-generation技能响应结构

{
    "choices": [
        {
            "message": {
                "content": "{\"invoice_number\": \"INV-001\", \"total_amount\": \"$1,234.56\", \"date\": \"2026-01-15\"}"
            }
        }
    ],
    "usage": {
        "prompt_tokens": 500,
        "completion_tokens": 50
    }
}

内容是JSON字符串。使用json.loads()解析。预建模型无需模式定义的即用型模型。模型 | 文档类型 ---|--- receipt-extraction | 收据 air-waybill-extraction | 空运单 bill-of-lading-and-shipping-request-extraction | 提单/装运请求 commercial-invoice-and-packing-list-extraction | 商业发票/装箱单 kr-export-declaration-certificate-extraction | 韩国出口报关单证预建模型用法示例

import os
from openai import OpenAI
client = OpenAI(
    api_key=os.environ["UPSTAGE_API_KEY"],
    base_url="https://api.upstage.ai/v1/information-extraction"
)
response = client.chat.completions.create(
    model="receipt-extraction",
    messages=[{
        "role": "user",
        "content": [{
            "type": "image_url",
            "image_url": {
                "url": "https://example.com/receipt.jpg"
            }
        }]
    }]
)
print(response.choices[0].message.content)

预建模型无需response_format。异步处理（大文档）

import os
import time
import requests
api_key = os.environ["UPSTAGE_API_KEY"]
headers = {
    "Authorization": f"Bearer {api_key}",
    "Content-Type": "application/json"
}
# 1. 提交异步任务
response = requests.post(
    "https://api.upstage.ai/v1/information-extraction/async",
    headers=headers,
    json={
        "model": "information-extract",
        "messages": [{
            "role": "user",
            "content": [{
                "type": "image_url",
                "image_url": {
                    "url": "FILE_URL"
                }
            }]
        }],
        "response_format": {
            "type": "json_schema",
            "json_schema": {
                "name": "schema",
                "schema": {...}
            }
        }
    }
)
job_id = response.json()["id"]
# 2. 查询结果
while True:
    status = requests.get(
        f"https://api.upstage.ai/v1/information-extraction/jobs/{job_id}",
        headers=headers
    ).json()
    if status["status"] == "completed":
        print(status["choices"][0]["message"]["content"])
        break
    time.sleep(5)

输出文件默认：将提取的JSON写入<系统临时目录>/<输入文件名>.extracted.json（例如/tmp/invoice.extracted.json）。使用tempfile.gettempdir()进行跨平台代码。覆盖：如果用户指定输出路径，则使用它。始终在响应中打印解析的绝对路径，以便用户可以找到文件。提示增强模式可以提高复杂表格/图像的准确性，但速度较慢。设置confidence: true以获取每个字段的置信度评分，用于质量过滤。

数据来源：ClawHub ↗ · 中文优化：龙虾技能库