Skip to main content

Monitor SNO Resource Usage

Track CPU, memory, and storage usage on the SNO node with automatic alerting for resource pressure.

Pipeline

input:
generate:
interval: 60s
mapping: 'root = {}'

pipeline:
processors:
# Get node resource usage
- command:
name: oc
args_mapping: '["adm", "top", "node", "--no-headers"]'

- mapping: |
# Parse: node-name CPU(cores) CPU% MEMORY(bytes) MEMORY%
let parts = content().string().split_regex("\\s+")
root.node_name = $parts.0
root.cpu_cores = $parts.1
root.cpu_percent = $parts.2.trim("%").parse_float()
root.memory_bytes = $parts.3
root.memory_percent = $parts.4.trim("%").parse_float()
root.cluster = env("CLUSTER_NAME")
root.timestamp = now()

# Get pod resource usage
- command:
name: oc
args_mapping: '["adm", "top", "pods", "--all-namespaces", "--no-headers"]'

- mapping: |
# Parse pod metrics
root.pod_metrics = content().string().split("\n").filter(l -> l != "").map_each(line -> {
let parts = line.split_regex("\\s+")
{
"namespace": $parts.0,
"pod": $parts.1,
"cpu": $parts.2,
"memory": $parts.3
}
})

# Aggregate by namespace
root.namespace_usage = this.pod_metrics.fold({}, tally, namespace -> {
$tally.set(namespace.namespace, ($tally.get(namespace.namespace).or(0) + 1))
})

# Check for resource pressure
- mapping: |
root.resource_alert = {
"high_cpu": this.cpu_percent > 80,
"high_memory": this.memory_percent > 85,
"cluster": @cluster,
"timestamp": @timestamp
}

output:
broker:
pattern: fan_out
outputs:
# Send metrics
- http_client:
url: https://metrics.company.com/sno-resources
verb: POST
batching:
count: 20
period: 5m

# Alert on high resource usage
- switch:
cases:
- check: 'this.resource_alert.high_cpu || this.resource_alert.high_memory'
output:
http_client:
url: https://alerts.company.com/sno-resources
verb: POST

What This Does

  • Checks every 60 seconds: Generates resource check trigger every minute
  • Node metrics: Collects CPU and memory usage from oc adm top node
  • Pod metrics: Lists resource usage for all pods across namespaces
  • Namespace aggregation: Counts pods per namespace
  • Alert thresholds: Triggers alert if CPU > 80% or memory > 85%
  • Dual output: Sends all metrics batched, alerts immediately

Example Output

{
"node_name": "sno-retail-001",
"cpu_cores": "2500m",
"cpu_percent": 65.5,
"memory_bytes": "12Gi",
"memory_percent": 72.3,
"cluster": "sno-retail-001",
"timestamp": "2024-11-12T10:30:00Z",
"pod_metrics": [
{"namespace": "production", "pod": "web-app-7d8f9c", "cpu": "100m", "memory": "256Mi"},
{"namespace": "kube-system", "pod": "etcd-sno", "cpu": "50m", "memory": "512Mi"}
],
"namespace_usage": {
"production": 12,
"kube-system": 45,
"openshift-monitoring": 18
},
"resource_alert": {
"high_cpu": false,
"high_memory": false,
"cluster": "sno-retail-001",
"timestamp": "2024-11-12T10:30:00Z"
}
}

Alert Thresholds

Current thresholds:

  • CPU > 80%: High CPU alert
  • Memory > 85%: High memory alert

Customize thresholds:

- mapping: |
root.resource_alert = {
"high_cpu": this.cpu_percent > 75, # Lower to 75%
"high_memory": this.memory_percent > 90, # Raise to 90%
"cluster": @cluster,
"timestamp": @timestamp
}

Metrics Collected

Node-level:

  • CPU cores used
  • CPU percentage
  • Memory bytes used
  • Memory percentage

Pod-level (per namespace):

  • Pod count by namespace
  • Individual pod CPU/memory usage

Prerequisites

The oc adm top command requires the metrics server to be running. This is typically included in OpenShift by default.

Verify metrics are available:

oc adm top node
oc adm top pods --all-namespaces

Next Steps