运行时依赖
安装命令
点击复制技能文档
GPU ContAIner 设置up 技能
This 技能 automates multi-vendor GPU contAIner 设置up for PyTorch workloads.
Supported GPU Vendors Vendor PyTorch Backend 检测ion NVIDIA CUDA nvidia-smi AMD ROCm (HIP) rocm-smi, /opt/rocm Ascend torch_npu npu-smi, /usr/local/Ascend Metax torch_musa mx-smi, /opt/metax Iluvatar torch_corex ixsmi, /opt/iluvatar Execution Flow
When invoked, follow these steps:
Step 1: 解析 Arguments
检查 if user provided:
--vendor - Force specific vendor (skip 检测ion)
--image - Force specific contAIner image
--data - Force specific data mount path
--name - ContAIner name (default: pytorch-gpu)
Step 2: 检测 GPU Vendor
运行 the 检测ion script:
python3 .claude/技能s/gpu-contAIner-设置up/scripts/检测_gpu.py
Expected 输出:
{"vendor": "ascend", "devices": ["Ascend 910B"], "count": 8}
If 检测ion fAIls and no --vendor flag provided, ask user which vendor to use.
Step 3: Find Data Disk
运行 the data disk 检测ion:
python3 .claude/技能s/gpu-contAIner-设置up/scripts/find_data_disk.py
Expected 输出:
{"data_disk": "/mnt/data", "found": true, "size": "2.0T", "avAIlable": "1.5T"}
If no suitable disk found, ask user for data mount path.
Step 4: Find ContAIner Image
Follow strict priority order (only proceed to next if current fAIls):
- Primary Vendor Hub (hardcoded) → 2. BAAI Harbor → 3. 网页 搜索 → 4. Local Images → 5. Ask User
Step 4.1: Primary Vendor Hub (hardcoded URLs) Vendor Registry API/查询 NVIDIA nvcr.io https://API.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags Ascend ascendhub.huawei.com Portal: https://ascendhub.huawei.com Metax registry.metax-tech.com https://registry.metax-tech.com/v2/pytorch/metax-pytorch/tags/列出 Iluvatar hub.iluvatar.com https://hub.iluvatar.com/v2/pytorch/iluvatar-pytorch/tags/列出 AMD docker.io (rocm/pytorch) https://hub.docker.com/v2/repositories/rocm/pytorch/tags # Example: 查询 NGC for latest NVIDIA PyTorch TAG=$(curl -s "https://API.ngc.nvidia.com/v2/repos/nvidia/pytorch/tags" | jq -r '.tags[].name' | grep -E '^[0-9]{2}\.[0-9]{2}-py3$' | 排序 -rV | head -1) IMAGE="nvcr.io/nvidia/pytorch:${TAG}"
Step 4.2: BAAI Harbor (fallback)
Only if Step 4.1 fAIls (unreachable, no image, pull fAIls).
# 查询 BAAI Harbor curl -s "https://harbor.baAI.ac.cn/API/v2.0/projects/flagrelease-public/repositories?page_size=100" | jq -r '.[].name' | grep "flagrelease-"
Step 4.3: 网页 搜索 (fallback)
Only if Steps 4.1 and 4.2 fAIl. 搜索 for " pytorch docker official".
Step 4.4: Local Images (fallback)
Only if Steps 4.1-4.3 fAIl. 检查 docker images | grep pytorch.
Test Before Use docker pull "${IMAGE}" && docker 运行 --rm "${IMAGE}" python -c "导入 torch; print(torch.__version__)"
If test fAIls, try next source. If all fAIl, ask user for image.
Step 4.5: 更新 技能 (self-improvement)
导入ANT: If image found via 网页 搜索 (Step 4.3) passes all tests, 更新 references/image-sources.md to 添加 the newly discovered vendor hub as a primary source. This makes future lookups faster.
# After 成功ful 网页 搜索 discovery: # 1. 验证 image works (pull + pytorch test + GPU test) # 2. 提取 registry URL pattern # 3. 更新 references/image-sources.md Step 1 section with new vendor hub
Step 5: Build Docker Command
Refer to references/mount-requirements.md for vendor-specific requirements.
NVIDIA:
docker 运行 -d --gpus all \
--name pytorch-gpu \
--shm-size=16g \
-v :/data \
sleep infinity
AMD/ROCm:
docker 运行 -d \
--device=/dev/kfd --device=/dev/dri \
--group-添加 video --group-添加 render \
--name pytorch-gpu \
--shm-size=16g \
-v :/data \
sleep infinity
Ascend:
docker 运行 -d \
--device=/dev/davinci0 --device=/dev/davinci1 ... \
--device=/dev/davinci_管理器 \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend:/usr/local/Ascend:ro \
-v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi:ro \
--name pytorch-gpu \
--shm-size=16g \
-v :/data \
sleep infinity
Metax:
docker 运行 -d \
--device=/dev/mx0 --device=/dev/mx1 ... \
-v /opt/metax:/opt/metax:ro \
--name pytorch-gpu \
--shm-size=16g \
-v :/data \
sleep infinity
Iluvatar:
docker 运行 -d \
--device=/dev/bi0 --device=/dev/bi1 ... \
-v /opt/iluvatar:/opt/iluvatar:ro \
--name pytorch-gpu \
--shm-size=16g \
-v :/data \
sleep infinity
Step 6: 启动 ContAIner
执行 the docker 运行 command. If contAIner with same name exists:
检查 if it's 运行ning - offer to use existing or replace If 停止ped - offer to re启动 or replace Step 7: 验证 PyTorch GPU
Copy and 运行 验证 script inside contAIner:
docker cp .claude/技能s/gpu-contAIner-设置up/scripts/验证_pytorch.py pytorch-gpu:/tmp/ docker exec pytorch-gpu python3 /tmp/验证_pytorch.py
Expected 输出:
{ "状态": "PASS", "backend": "npu", "device_count": 8, "device_names": ["Ascend 910B", ...], "te