Skip to main content

Monitor SNO Cluster Health

Automatically check SNO cluster health and send alerts when issues are detected.

Pipeline

input:
generate:
interval: 60s
mapping: |
root.check_time = now()
root.cluster = env("CLUSTER_NAME")

pipeline:
processors:
# Check node status
- command:
name: oc
args_mapping: '["get", "nodes", "-o", "json"]'

- mapping: |
root.nodes = content().parse_json().items
root.node_ready = this.nodes.all(n ->
n.status.conditions.any(c -> c.type == "Ready" && c.status == "True")
)
root.node_name = this.nodes.index(0).metadata.name

# Check cluster operators
- command:
name: oc
args_mapping: '["get", "clusteroperators", "-o", "json"]'

- mapping: |
root.operators = content().parse_json().items
root.degraded_operators = this.operators.filter(op ->
op.status.conditions.any(c -> c.type == "Degraded" && c.status == "True")
).map_each(op -> op.metadata.name)

root.all_operators_healthy = this.degraded_operators.length() == 0

# Check pod status across namespaces
- command:
name: oc
args_mapping: '["get", "pods", "--all-namespaces", "-o", "json"]'

- mapping: |
root.pods = content().parse_json().items
root.total_pods = this.pods.length()
root.running_pods = this.pods.filter(p -> p.status.phase == "Running").length()
root.failed_pods = this.pods.filter(p ->
p.status.phase == "Failed" || p.status.phase == "CrashLoopBackOff"
).map_each(p -> {
"namespace": p.metadata.namespace,
"name": p.metadata.name,
"phase": p.status.phase
})

# Aggregate health status
- mapping: |
root.health_report = {
"cluster": @cluster,
"location": env("LOCATION"),
"timestamp": @check_time,
"node_ready": @node_ready,
"operators_healthy": @all_operators_healthy,
"degraded_operators": @degraded_operators,
"total_pods": @total_pods,
"running_pods": @running_pods,
"failed_pods": @failed_pods,
"cluster_healthy": @node_ready && @all_operators_healthy && @failed_pods.length() == 0
}

output:
switch:
cases:
# Alert if unhealthy
- check: '!this.health_report.cluster_healthy'
output:
broker:
pattern: fan_out
outputs:
# Send alert
- http_client:
url: https://alerts.company.com/sno-health
verb: POST
headers:
Content-Type: application/json
# Log alert
- aws_s3:
bucket: sno-health-alerts
path: 'alerts/${! env("CLUSTER_NAME") }/${! timestamp_unix() }.json'

# Normal health metrics
- output:
http_client:
url: https://metrics.company.com/sno-health
verb: POST
batching:
count: 10
period: 5m

What This Does

  • Checks every 60 seconds: Generates health check trigger every minute
  • Node status: Verifies the single node is Ready
  • Operator health: Checks if any cluster operators are degraded
  • Pod health: Counts running vs failed pods across all namespaces
  • Conditional alerting: Sends immediate alert if cluster is unhealthy, otherwise batches metrics
  • Detailed failure info: Includes list of degraded operators and failed pods

Health Criteria

Healthy cluster:

  • Node status is Ready
  • All cluster operators are not degraded
  • No failed or crash-looping pods

Unhealthy cluster (triggers alert):

  • Node is not Ready
  • Any operator is degraded
  • Any pods are in Failed or CrashLoopBackOff state

Example Health Report

{
"cluster": "sno-retail-001",
"location": "store-chicago-north",
"timestamp": "2024-11-12T10:30:00Z",
"node_ready": true,
"operators_healthy": true,
"degraded_operators": [],
"total_pods": 145,
"running_pods": 145,
"failed_pods": [],
"cluster_healthy": true
}

Alert Example

When unhealthy, alert includes specific failures:

{
"cluster": "sno-retail-001",
"cluster_healthy": false,
"degraded_operators": ["authentication", "console"],
"failed_pods": [
{"namespace": "production", "name": "web-app-7d8f9c", "phase": "CrashLoopBackOff"}
]
}

Customization

Change check frequency: Adjust interval in input (e.g., interval: 5m for every 5 minutes)

Alert thresholds: Modify health criteria in the aggregation mapping

Additional checks: Add more command processors to check storage, networking, etc.

Next Steps