Monitor SNO Resource Usage
Track CPU, memory, and storage usage on the SNO node with automatic alerting for resource pressure.
Pipeline
input:
generate:
interval: 60s
mapping: 'root = {}'
pipeline:
processors:
# Get node resource usage
- command:
name: oc
args_mapping: '["adm", "top", "node", "--no-headers"]'
- mapping: |
# Parse: node-name CPU(cores) CPU% MEMORY(bytes) MEMORY%
let parts = content().string().split_regex("\\s+")
root.node_name = $parts.0
root.cpu_cores = $parts.1
root.cpu_percent = $parts.2.trim("%").parse_float()
root.memory_bytes = $parts.3
root.memory_percent = $parts.4.trim("%").parse_float()
root.cluster = env("CLUSTER_NAME")
root.timestamp = now()
# Get pod resource usage
- command:
name: oc
args_mapping: '["adm", "top", "pods", "--all-namespaces", "--no-headers"]'
- mapping: |
# Parse pod metrics
root.pod_metrics = content().string().split("\n").filter(l -> l != "").map_each(line -> {
let parts = line.split_regex("\\s+")
{
"namespace": $parts.0,
"pod": $parts.1,
"cpu": $parts.2,
"memory": $parts.3
}
})
# Aggregate by namespace
root.namespace_usage = this.pod_metrics.fold({}, tally, namespace -> {
$tally.set(namespace.namespace, ($tally.get(namespace.namespace).or(0) + 1))
})
# Check for resource pressure
- mapping: |
root.resource_alert = {
"high_cpu": this.cpu_percent > 80,
"high_memory": this.memory_percent > 85,
"cluster": @cluster,
"timestamp": @timestamp
}
output:
broker:
pattern: fan_out
outputs:
# Send metrics
- http_client:
url: https://metrics.company.com/sno-resources
verb: POST
batching:
count: 20
period: 5m
# Alert on high resource usage
- switch:
cases:
- check: 'this.resource_alert.high_cpu || this.resource_alert.high_memory'
output:
http_client:
url: https://alerts.company.com/sno-resources
verb: POST
What This Does
- Checks every 60 seconds: Generates resource check trigger every minute
- Node metrics: Collects CPU and memory usage from
oc adm top node - Pod metrics: Lists resource usage for all pods across namespaces
- Namespace aggregation: Counts pods per namespace
- Alert thresholds: Triggers alert if CPU > 80% or memory > 85%
- Dual output: Sends all metrics batched, alerts immediately
Example Output
{
"node_name": "sno-retail-001",
"cpu_cores": "2500m",
"cpu_percent": 65.5,
"memory_bytes": "12Gi",
"memory_percent": 72.3,
"cluster": "sno-retail-001",
"timestamp": "2024-11-12T10:30:00Z",
"pod_metrics": [
{"namespace": "production", "pod": "web-app-7d8f9c", "cpu": "100m", "memory": "256Mi"},
{"namespace": "kube-system", "pod": "etcd-sno", "cpu": "50m", "memory": "512Mi"}
],
"namespace_usage": {
"production": 12,
"kube-system": 45,
"openshift-monitoring": 18
},
"resource_alert": {
"high_cpu": false,
"high_memory": false,
"cluster": "sno-retail-001",
"timestamp": "2024-11-12T10:30:00Z"
}
}
Alert Thresholds
Current thresholds:
- CPU > 80%: High CPU alert
- Memory > 85%: High memory alert
Customize thresholds:
- mapping: |
root.resource_alert = {
"high_cpu": this.cpu_percent > 75, # Lower to 75%
"high_memory": this.memory_percent > 90, # Raise to 90%
"cluster": @cluster,
"timestamp": @timestamp
}
Metrics Collected
Node-level:
- CPU cores used
- CPU percentage
- Memory bytes used
- Memory percentage
Pod-level (per namespace):
- Pod count by namespace
- Individual pod CPU/memory usage
Prerequisites
The oc adm top command requires the metrics server to be running. This is typically included in OpenShift by default.
Verify metrics are available:
oc adm top node
oc adm top pods --all-namespaces
Next Steps
- Cluster Health: Combine with health monitoring
- Offline-Resilient: Add buffering for metrics during outages
- Best Practices: Optimize monitoring intervals for SNO