Monitor SNO Cluster Health
Automatically check SNO cluster health and send alerts when issues are detected.
Pipeline
input:
generate:
interval: 60s
mapping: |
root.check_time = now()
root.cluster = env("CLUSTER_NAME")
pipeline:
processors:
# Check node status
- command:
name: oc
args_mapping: '["get", "nodes", "-o", "json"]'
- mapping: |
root.nodes = content().parse_json().items
root.node_ready = this.nodes.all(n ->
n.status.conditions.any(c -> c.type == "Ready" && c.status == "True")
)
root.node_name = this.nodes.index(0).metadata.name
# Check cluster operators
- command:
name: oc
args_mapping: '["get", "clusteroperators", "-o", "json"]'
- mapping: |
root.operators = content().parse_json().items
root.degraded_operators = this.operators.filter(op ->
op.status.conditions.any(c -> c.type == "Degraded" && c.status == "True")
).map_each(op -> op.metadata.name)
root.all_operators_healthy = this.degraded_operators.length() == 0
# Check pod status across namespaces
- command:
name: oc
args_mapping: '["get", "pods", "--all-namespaces", "-o", "json"]'
- mapping: |
root.pods = content().parse_json().items
root.total_pods = this.pods.length()
root.running_pods = this.pods.filter(p -> p.status.phase == "Running").length()
root.failed_pods = this.pods.filter(p ->
p.status.phase == "Failed" || p.status.phase == "CrashLoopBackOff"
).map_each(p -> {
"namespace": p.metadata.namespace,
"name": p.metadata.name,
"phase": p.status.phase
})
# Aggregate health status
- mapping: |
root.health_report = {
"cluster": @cluster,
"location": env("LOCATION"),
"timestamp": @check_time,
"node_ready": @node_ready,
"operators_healthy": @all_operators_healthy,
"degraded_operators": @degraded_operators,
"total_pods": @total_pods,
"running_pods": @running_pods,
"failed_pods": @failed_pods,
"cluster_healthy": @node_ready && @all_operators_healthy && @failed_pods.length() == 0
}
output:
switch:
cases:
# Alert if unhealthy
- check: '!this.health_report.cluster_healthy'
output:
broker:
pattern: fan_out
outputs:
# Send alert
- http_client:
url: https://alerts.company.com/sno-health
verb: POST
headers:
Content-Type: application/json
# Log alert
- aws_s3:
bucket: sno-health-alerts
path: 'alerts/${! env("CLUSTER_NAME") }/${! timestamp_unix() }.json'
# Normal health metrics
- output:
http_client:
url: https://metrics.company.com/sno-health
verb: POST
batching:
count: 10
period: 5m
What This Does
- Checks every 60 seconds: Generates health check trigger every minute
- Node status: Verifies the single node is Ready
- Operator health: Checks if any cluster operators are degraded
- Pod health: Counts running vs failed pods across all namespaces
- Conditional alerting: Sends immediate alert if cluster is unhealthy, otherwise batches metrics
- Detailed failure info: Includes list of degraded operators and failed pods
Health Criteria
Healthy cluster:
- Node status is Ready
- All cluster operators are not degraded
- No failed or crash-looping pods
Unhealthy cluster (triggers alert):
- Node is not Ready
- Any operator is degraded
- Any pods are in Failed or CrashLoopBackOff state
Example Health Report
{
"cluster": "sno-retail-001",
"location": "store-chicago-north",
"timestamp": "2024-11-12T10:30:00Z",
"node_ready": true,
"operators_healthy": true,
"degraded_operators": [],
"total_pods": 145,
"running_pods": 145,
"failed_pods": [],
"cluster_healthy": true
}
Alert Example
When unhealthy, alert includes specific failures:
{
"cluster": "sno-retail-001",
"cluster_healthy": false,
"degraded_operators": ["authentication", "console"],
"failed_pods": [
{"namespace": "production", "name": "web-app-7d8f9c", "phase": "CrashLoopBackOff"}
]
}
Customization
Change check frequency: Adjust interval in input (e.g., interval: 5m for every 5 minutes)
Alert thresholds: Modify health criteria in the aggregation mapping
Additional checks: Add more command processors to check storage, networking, etc.
Next Steps
- Resource Monitoring: Track CPU, memory, and storage usage
- Collect Logs: Correlate health issues with logs
- Best Practices: Optimize monitoring for SNO