详细分析 ▾
运行时依赖
版本
test
安装命令 点击复制
技能文档
Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.
Quick Start
1. Setup (one-time)
Detect platform and install dependencies:
bash scripts/setup/setup-linux.sh --headless # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop # Linux with desktop
bash scripts/setup/setup-mac.sh # macOS
python scripts/setup/setup-win.py # Windows
2. Configure API
Copy config.example.json to config.json and fill in your vision API credentials.
You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.
{
"vision": {
"baseUrl": "https://api.siliconflow.cn/v1",
"apiKey": "sk-your-key",
"model": "Qwen/Qwen3-VL-32B"
}
}
Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL.
See references/API_CONFIG.md for all supported providers and detailed setup.
3. Usage
The skill operates through a screenshot-analyze-action loop:
- Take screenshot →
bash scripts/platform/screenshot.sh [output_path] [display] - Analyze with AI →
python3 scripts/vision/analyze.py --image--task " " - Execute action →
python3 scripts/platform/execute.py --action[options] - Full task loop →
python3 scripts/core/run_task.py --task ""
Architecture
User task → run_task.py (orchestrator)
├── screenshot.sh (capture screen)
├── diff_check.py (detect changes, skip if unchanged → saves tokens)
├── analyze.py (send screenshot + task to vision API)
├── safety_check.py (block dangerous operations)
├── execute.py (xdotool/cliclick/pyautogui)
└── loop until done or timeout
Platform Tools
| Platform | Screenshot | Mouse/Keyboard | Notes |
|---|---|---|---|
| Linux | scrot | xdotool | Headless: XFCE4 + VNC |
| macOS | screencapture | cliclick | Needs Accessibility permission |
| Windows | pyautogui | pyautogui | No extra setup needed |
Vision Providers
Supports any OpenAI-compatible vision API. You choose the provider and model.
Recommended Models
| Model | Provider | Cost/Task | Quality |
|---|---|---|---|
| Qwen3-VL-32B | SiliconFlow | Low | ★★★★ |
| GLM-4V-Plus | Zhipu BigModel | Low | ★★★★ |
| GPT-5.4-Mini | OpenAI / relays | Medium | ★★★★★ |
| GPT-5.4 CUA | OpenAI | High | ★★★★★ |
| Llama 3.2 Vision | Ollama (local) | Free | ★★ |
No defaults are hardcoded — you must configure your own API credentials before use.
Action Types
click— Click at (x, y). Supports left/right/double-click.type— Type text string.key— Press a key (Return, Tab, Escape, etc.).scroll— Scroll up or down.drag— Drag from (x1,y1) to (x2,y2).wait— Wait for screen to update.done— Task complete.failed— Cannot complete task.
Safety
- Blocked: rm -rf, format disk, shutdown, drop database, etc.
- Confirmation required: delete, sudo, payment-related operations
- Limits: max 5 minutes, max 100 actions per task
- Logging: all screenshots saved to
/tmp/screen-vision/logs/ - Auto-stop on error or API failure
Examples
See references/EXAMPLES.md for usage examples.
Config
| Variable | Default | Description |
|---|---|---|
SV_VISION_API_KEY | — | Vision API key |
SV_VISION_BASE_URL | — | API endpoint (required) |
SV_VISION_MODEL | — | Vision model name (required) |
SV_DISPLAY | :1 | X11 display (Linux) |
SV_MAX_DURATION | 5 | Max task duration (min) |
SV_MAX_ACTIONS | 100 | Max actions per task |
SV_SCREENSHOT_INTERVAL | 1.0 | Seconds between screenshots |
免费技能或插件可能存在安全风险,如需更匹配、更安全的方案,建议联系付费定制