Skip to main content

Monitoring

kube-prometheus-stack

The standard monitoring stack: Prometheus (metrics), Grafana (dashboards), Alertmanager (alerts).

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
-f monitoring-values.yaml

Key monitoring-values.yaml settings

prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
# Scrape all ServiceMonitors cluster-wide
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false

grafana:
adminPassword: <from-secret>
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.internal.example.com
persistence:
enabled: true
storageClassName: gp3
size: 5Gi

alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: gp3
resources:
requests:
storage: 2Gi

Useful queries

# Pod restart rate over 5m
rate(kube_pod_container_status_restarts_total[5m]) > 0

# CPU throttling percentage per container
rate(container_cpu_cfs_throttled_seconds_total[5m])
/ rate(container_cpu_cfs_periods_total[5m]) * 100 > 25

# Memory usage vs request
container_memory_working_set_bytes
/ on(pod, container) kube_pod_container_resource_requests{resource="memory"}

# Node disk pressure
kube_node_status_condition{condition="DiskPressure", status="true"} == 1

ServiceMonitor

Expose custom application metrics:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-server
namespace: production
labels:
release: kube-prometheus-stack
spec:
selector:
matchLabels:
app: api-server
endpoints:
- port: metrics
path: /metrics
interval: 30s