除了kubectl logs、events,kubectl describe 也是K8s排查问题必备命令,有点类似docker inspect,它返回结果比 kubectl get 详细,比YAML直观,用过都说好。本文内容较长,但都是干货,坚持看完一定会有收获!
# 查看Pod完整信息(opsnot最常用)
kubectl describe pod opsnot-postgresql
# 指定命名空间
kubectl describe pod my-pod -n opsnot-postgresql
# 查看所有Pod(opsnot.com提醒:要慎用!输出太多!)
kubectl describe pods
# 查看Deployment(opsnot.com常用)
kubectl describe deployment opsnot-mariadb
# 查看Service
kubectl describe service opsnot-service
# 查看Node
kubectl describe node node-mariadb
# 查看ConfigMap
kubectl describe configmap mariadb-config
# 查看Secret(敏感信息会隐藏)
kubectl describe secret mariadb-secret
# 直接看Pod信息,重点看最底部的Events
kubectl describe pod ops-elasticsearch -n prod
# opsnot.com经验:Events是排查问题的关键!
# 看到ImagePullBackOff、CrashLoopBackOff都能找到原因
实操案例:某Pod一直Pending,用describe一看Events:
Warning  FailedScheduling  node didn't have enough memory
很明显内存不足,调整下requests轻松解决。
# Pod describe里会显示:
 - 容器状态(Running/Waiting/Terminated)
 - 重启次数
 - 退出码
 - 最后一次重启原因
kubectl describe pod my-elasticsearch | grep -A 10 "State:"
kubectl describe pod my-elasticsearch | grep "Restart Count"
实操案例:容器反复重启,describe显示:
Last State: Terminated
  Reason: Error
  Exit Code: 137
容器退出码137,那就是OOM killed了,内存调大些就ok了。
# describe会显示requests和limits
kubectl describe pod opsnot-redis | grep -A 5 "Limits:"
kubectl describe pod opsnot-redis | grep -A 5 "Requests:"
# 查看实际分配的资源
kubectl describe pod ops-not-kafka | grep "QoS Class"
实操案例:集群资源紧张,这个命令可以找到那些设置了过高的requests但实际用不到的Pod,优化资源,节能提效
# 查看Node详情
kubectl describe node worker-node-1
# 重点看Conditions部分(opsnot.com推荐)
kubectl describe node worker-node-1 | grep -A 10 "Conditions:"
# 常见状态:
# Ready: True/False
# MemoryPressure: True/False
# DiskPressure: True/False
# PIDPressure: True/False
实操案例:Pod调度不上node节点,describe node发现:
DiskPressure: True
Message: kubelet has disk pressure
很明显是磁盘占满了,清理即可,一般都是日志爆了。
# 查看Node上的资源使用情况(加班哥推荐)
kubectl describe node worker-node-1 | grep -A 20 "Allocated resources:"
# 会显示:
# CPU Requests: 1200m (60% of 2 cores)
# Memory Requests: 4Gi (50% of 8Gi)
# describe node会列出该节点上的所有Pod及其资源占用情况(加班哥墙裂推荐:非常好用!!!)
kubectl describe node worker-node-1 | grep -A 50 "Non-terminated Pods:"
# 返回一般是这样的:
Non-terminated Pods:          (10 in total)
  Namespace                   Name                CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                ------------  ----------  ---------------  -------------  ---
  default                     nginx-opsnot-abc12  100m (2%)     200m (5%)   128Mi (1%)       256Mi (3%)     5d
  default                     redis-xyz34         50m (1%)      100m (2%)   64Mi (0%)        128Mi (1%)     3d
  kube-system                 kube-proxy-5678     100m (2%)     0 (0%)      64Mi (0%)        0 (0%)         15d
  kube-system                 coredns-1234        100m (2%)     200m (5%)   70Mi (0%)        170Mi (2%)     15d
# 查看Service详情
kubectl describe service ops-not-service
# 重点看:
# - Selector: 匹配哪些Pod
# - Endpoints: 实际关联的Pod IP
# - Port配置
kubectl describe svc my-service | grep Selector
kubectl describe svc my-service | grep Endpoints
实操案例:Service访问不通,describe发现Endpoints是空的:
Endpoints: <none>
一般是Selector写错了,Pod的Label不匹配,改下Label就行了
# 查看Service暴露方式
kubectl describe svc ops-not-service | grep Type
# Type: ClusterIP / NodePort / LoadBalancer
# 查看端口映射
kubectl describe svc ops-not-service | grep Port
# 查看Deployment
kubectl describe deployment opsnot-rabbitmq
# 重点关注:
# - Replicas: 期望数量 vs 实际运行数量
# - Conditions: 部署状态
# - Events: 滚动更新记录
kubectl describe deploy opsnot-rabbitmq | grep Replicas
kubectl describe deploy opsnot-rabbitmq | grep -A 5 "Conditions:"
实操案例:发版后只有一部分Pod更新成功,describe显示:
Replicas: 3 desired | 2 updated | 3 total | 2 available
Conditions:
  Progressing: False
  Reason: ProgressDeadlineExceeded
这种情况基本是新版本镜像有问题导致Pod起不来,直接回滚
# 查看更新策略
kubectl describe deploy opsnot-rabbitmq | grep -A 3 "StrategyType:"
# 输出示例:
# StrategyType: RollingUpdate
# RollingUpdateStrategy:
#   Max Surge: 25%
#   Max Unavailable: 25%
# 查看PVC
kubectl describe pvc opsnot-pvc
# 重点看:
# - Status: Bound/Pending
# - Volume: 绑定的PV名称
# - Capacity: 实际容量
# - Events: 绑定失败原因
实操场景:PVC一直Pending,describe显示:
Events:
  Warning  ProvisioningFailed  no volume plugin matched
基本是StorageClass配置错误,改一下就行了
# 查看PV
kubectl describe pv opsnot-pv-name
# 重点看:
# - Status: Available/Bound/Released
# - Claim: 被哪个PVC使用
# - Reclaim Policy: Delete/Retain
# - Access Modes: ReadWriteOnce/ReadWriteMany
# 查看ConfigMap详情
kubectl describe configmap my-config
# 会显示所有键值对(opsnot.com提醒:数据量大的话会截断)
kubectl describe cm my-config | grep -A 20 "Data"
#为什么会截断?
kubectl describe 命令有输出长度限制,主要是为了:
防止终端被大量输出淹没
提高命令响应速度
避免网络传输过大数据
#查看cm被截断了怎么办?
看yaml呗,这玩意是完整的
# 查看Secret(数据会被隐藏)
kubectl describe secret my-secret
# 输出示例:
# Data
# ====
# password: 16 bytes
# username: 8 bytes
# opsnot提示:真实数据不会显示,只显示大小
# 查看Ingress
kubectl describe ingress my-ingress
# 重点看:
# - Rules: 路由规则
# - Backend: 后端Service
# - Events: 配置更新记录
kubectl describe ing my-ingress | grep -A 20 "Rules:"
# 查看Ingress分配的IP
kubectl describe ingress opsnot-ingress | grep Address
# 查看TLS配置
kubectl describe ing opsnot-ingress | grep -A 5 "TLS:"
# opsnot.com故障排查脚本
#!/bin/bash
NS=${1:-default}
echo "=== 查找 $NS 命名空间中的异常 Pod ==="
# 检查命名空间是否存在
if ! kubectl get ns $NS &> /dev/null; then
    echo "错误: 命名空间 $NS 不存在!"
    exit 1
fi
# 获取异常 Pod 列表
PODS=$(kubectl get pods -n $NS --field-selector=status.phase!=Running -o name 2>/dev/null)
if [ -z "$PODS" ]; then
    echo "加班哥没有发现异常 Pod"
    exit 0
fi
for pod in $PODS; do
    echo "--- $pod ---"
    kubectl describe $pod -n $NS | grep -A 15 "Events:"
    echo "================================"
done
# 检查所有Node状态
for node in $(kubectl get nodes -o name); do
    echo "=== $node ==="
    kubectl describe $node | grep -A 5 "Conditions:"
done
# 检查命名空间内所有Service的Endpoints
kubectl get svc -n prod -o name | while read svc; do
    echo "$svc:"
    kubectl describe $svc -n prod | grep Endpoints
done
# Pod最近的事件(按时间排序)
kubectl describe pod ops-not-pod | grep -A 50 "Events:" | tail -20
# 所有资源的事件
kubectl get events --sort-by=.metadata.creationTimestamp
# opsnot经验:结合describe和events一起看
kubectl describe pod ops-not-pod && kubectl get events --field-selector involvedObject.name=my-pod
# 导出Pod完整信息
kubectl describe pod ops-not-pod > pod-describe.txt
# 导出所有资源信息
kubectl describe all -n production > cluster-info.txt
# 1. 先看Pod状态
kubectl get pod ops-not-mysql
# 2. describe看Events
kubectl describe pod ops-not-mysql
# 常见原因:
 - ImagePullBackOff: 镜像拉取失败
 - CrashLoopBackOff: 容器启动后立即退出
 - Pending: 资源不足或调度失败
 - Error: 配置错误
# 1. 检查Service的Endpoints
kubectl describe svc opsnot-nginx-service | grep Endpoints
# 2. 如果Endpoints为空,检查Selector
kubectl describe svc opsnot-nginx-service | grep Selector
kubectl get pods --show-labels
# 3. 检查Pod是否Ready
kubectl get pods -l app=opsnot-blog
# 1. 检查Pod IP
kubectl describe pod my-pod | grep "IP:"
# 2. 检查Service ClusterIP
kubectl describe svc my-service | grep "IP:"
# 3. 检查DNS
kubectl describe pod my-pod | grep -A 5 "DNS"
# 加班哥排查流程:
# Pod -> Service -> Ingress 逐层排查
# 1. 检查Node资源
kubectl describe nodes | grep -A 10 "Allocated resources:"
# 2. 检查Pod资源配置
kubectl describe pod opsnot-pod | grep -A 5 "Limits:"
# 3. 看Events里有没有资源不足的告警
kubectl describe pod opsnot-pod | grep "Insufficient"
# describe + logs 组合排查
kubectl describe pod my-pod && kubectl logs my-pod --tail=50
# describe + exec 组合
kubectl describe pod my-pod
kubectl exec -it my-pod -- sh
# describe + top 查看资源使用
kubectl describe pod my-pod
kubectl top pod my-pod
# 实时监控Pod变化(加班哥常用)
watch -n 2 'kubectl describe pod opsnot-pod | grep -A 10 "Events:"'
# 实时监控Node状态
watch kubectl describe node worker-1 | grep "Allocated resources:" -A 20
# opsnot.com自用脚本:快速查看Pod关键信息
#!/bin/bash
POD=$1
NS=${2:-default}
echo "=== 基本信息 ==="
kubectl describe pod $POD -n $NS | grep "Status:\|IP:\|Node:"
echo -e "\n=== 容器状态 ==="
kubectl describe pod $POD -n $NS | grep "State:" -A 3
echo -e "\n=== 重启信息 ==="
kubectl describe pod $POD -n $NS | grep "Restart Count"
echo -e "\n=== 最近事件 ==="
kubectl describe pod $POD -n $NS | grep "Events:" -A 15 | tail -10
# describe输出很多,用管道过滤关键信息
kubectl describe pod ops-not-pod | grep -E "Status:|Events:|State:|Restart"
# 只看特定命名空间,避免全局搜索
kubectl describe pods -n opsnot-namespace
# 结合-o wide查看更多信息
kubectl get pods -o wide
kubectl describe pod ops-not-pod
# opsnot.com建议:先用get快速定位,再用describe详细排查
kubectl describe 输出内容极其丰富,实为排查问题一利器,以上内容基本涵盖describe所有场景。
感谢大家耐心看完,如果觉得有帮助,请帮忙点赞转发,加班哥继续加班输出干货!
本文由 opsnot.com 整理,转载请注明出处,喜欢就关注一下吧!