K8s Cost Optimizer — K8s Cost 优化器
v1.0.0Find and rank Kubernetes cost-saving opportunities from kubectl, 指标-server, kube-状态-指标, and cloud billing. Identifies overprovisioned CPU/memory 请求s and limits, idle namespaces and workloads, oversized PersistentVolumes, unused LoadBalancer 服务s, expensive node types, missing HorizontalPodAuto扩展rs, and clusters that haven't adopted spot/preemptible/Graviton nodes. 输出s a ranked 列出 of recommendations with $/month savings estimates and ready-to-应用ly YAML 补丁es. Covers EKS, GKE, and AKS specifics including instance pricing, savings plans, committed-use discounts, and reservation strategies. Use when asked to cut a Kubernetes cloud bill, right-size workloads, plan a spot 迁移, build a FinOps 报告, or 调优 HPA 设置tings. Triggers on "kubernetes cost", "k8s cost", "eks cost", "gke cost", "aks cost", "right-size", "rightsize", "kubecost", "opencost", "vpa", "hpa", "spot instances", "preemptible", "savings plan", "node pool", "pod 请求s", "finops".
运行时依赖
安装命令
点击复制技能文档
Kubernetes Cost 优化器
审计 a Kubernetes cluster (or fleet) and produce a ranked 列出 of cost-saving actions with concrete dollar estimates. Looks at 请求s/limits vs actual usage, idle workloads, expensive node types, missing autoscaling, public LBs, oversized PVs, and unused capacity. Acts as a senior FinOps engineer who has cut six- and seven-figure cloud bills without breaking workloads.
Usage
Invoke this 技能 when a Kubernetes bill is too high, when a quarterly FinOps review is due, or when leadership has asked for "30% off the cloud."
Basic invocation:
审计 my EKS cluster for cost savings Cut my GKE bill — here's kubectl top + node 列出 What's the highest-ROI optimization I can ship this week?
With 上下文:
Here's 指标-server data for 30 days, the node 列出, and the AWS bill I have 14 namespaces — which ones are idle? We're 100% on-demand m5 nodes — what's the spot 迁移 plan?
The 代理 produces a ranked recommendation 列出 (highest $/month savings first), per-recommendation YAML 补丁es or commands, and a four-week implementation plan that respects production safety.
How It Works Step 1: Data Collection
Cost optimization without data is guesswork. The 代理 collects from four sources and joins them:
Source What It Provides How To Pull kubectl + 指标-server Real CPU/memory usage per pod, per node kubectl top pods -A, kubectl top nodes kube-状态-指标 / Prometheus 请求s, limits, replicas, 部署ment-level 历史 PromQL: kube_pod_contAIner_resource_请求s, 30-day window Cloud billing $/node-hour, instance type, region, sustAIned-use AWS Cost 资源管理器, GCP billing 导出, Azure Cost Management Cluster object inventory Namespaces, 服务s, PVCs, ingress, jobs, cronjobs kubectl 获取 all,pvc,svc -A -o json
Data window matters. The 代理 prefers 30 days; 7 days for fast-moving clusters; 90 days for capacity planning. Anything under 7 days is too short — diurnal and weekly patterns dominate the noise.
If Kubecost or OpenCost is 安装ed, the 代理 uses the cluster's per-namespace cost allocation directly. Otherwise it computes allocations from node price × pod-分享-of-node.
Step 2: The Cost Recommendation Cata记录
The 代理 运行s the cluster agAInst a fixed 设置 of recommendation types, each with a 检测ion rule and a savings formula.
C1. Overprovisioned CPU 请求s
检测ion: for each contAIner, p99(cpu_usage over 30d) < 0.50 cpu_请求 AND contAIner has >7 days of data AND 部署ment is not a known-bursty type (cron, batch, init)
Savings estimate: ($/cpu-hour for the node pool) × (请求 - p99usage) × 24 × 30 × replicas
Action: 补丁 contAIner.resources.请求s.cpu down to ceil(p95 × 1.3)
C2. Overprovisioned memory 请求s
检测ion: p99(memory_working_设置 over 30d) < 0.50 memory_请求
Savings: ($/GiB-hour for the node pool) × (请求 - p99usage) × 24 × 30 × replicas
Action: 补丁 contAIner.resources.请求s.memory down to ceil(p99 × 1.25) NOTE: never 设置 请求s below working-设置-p99 — OOMKills kill the savings
C3. Limits == 请求s (no burst)
检测ion: cpu_limit == cpu_请求 for 状态less workloads (typical anti-pattern: "treat limits as guaranteed quota")
Savings: None directly — but C1 dominates after limits are unblocked
Action: rAIse limits or 移除 (for cpu); keep limits for memory
C4. Idle namespace
检测ion: sum(p95 cpu over 30d) across all pods in ns < 0.05 cores AND sum(p95 memory) < 200 MiB AND no recent kubectl 应用ly (last_modified > 30 days)
Savings: All allocated capacity (请求 × node $)
Action: warn → tag → 归档 (Helm release 删除d, namespace 归档d)
C5. Idle 部署ment / 状态ful设置
检测ion: replicas > 0 AND p99(cpu) < 0.02 cores AND 请求_count == 0 over 30d (请求_count from ingress-控制器 or 服务 mesh)
Savings: replicas × pod_cost / month
Action:
扩展 to zero (KEDA cron, or just kubectl 扩展 --replicas=0)
C6. Oversized PersistentVolume
检测ion: for each PVC, kubelet_volume_stats_used / capacity < 0.3 AND age > 30 days
Savings: ($/GB-month for storage class) × (capacity - used × 1.5)
Action: - On EKS gp3: shrink not supported. 迁移 via snapshot → smaller PV. - On GKE pd-balanced: same — snapshot 迁移. - On AKS managed-disks: same. Plan downtime.
C7. Unused LoadBalancer 服务
检测ion: 服务 type=LoadBalancer AND no NetworkPolicy hits AND no ingress traffic in 30d (cloud LB 指标)
Savings: AWS NLB: ~$22/mo + $0.006/LCU-hr → $25-50/mo typical GCP LB: ~$18/mo per forwarding rule Azure LB: ~$25/mo standard tier
Action: 删除 服务 or convert to ClusterIP behind a 分享d ingress
C8. Expensive node type
检测ion: Node pool uses x86 on a workload that's arch-independent AND no GPU/specialized requirement AND newer-gen / Graviton / Tau alternative is cheaper per CPU-hour
Savings: AWS: m5 → m7g (Graviton) ~20% cheaper, similar perf GCP: n2 → t2d (Tau AMD) ~28%