Monitor Your Edge Fleet
You've deployed jobs to edge nodes, and they're processing data across your distributed infrastructure. But how do you know everything's working? When a node goes offline, when a job fails, or when throughput drops—you need to know immediately.
In this tutorial, you'll build a complete monitoring stack for Expanso using Prometheus and Grafana. You'll set up metrics collection, create dashboards to visualize fleet health, configure alerts for critical issues, and learn how to debug problems using observability data. By the end, you'll have production-grade monitoring that keeps you informed about your edge infrastructure.
This tutorial takes about 45-60 minutes to complete.
What You'll Learn
- How to configure Expanso to export metrics via OpenTelemetry
- How to set up Prometheus to collect metrics from orchestrator and edge nodes
- How to create Grafana dashboards for job and node health visualization
- How to monitor job execution states and throughput
- How to track node connectivity, heartbeats, and resources
- How to configure alerts for node disconnections, job failures, and performance issues
- How to correlate logs with metrics for debugging
- How to identify and troubleshoot common edge infrastructure problems
Prerequisites
Before starting, make sure you have:
- A running Expanso deployment with orchestrator and at least one edge node (from the First Edge Deployment tutorial)
- At least one deployed job actively processing data
- Docker or Podman installed for running Prometheus and Grafana
- Basic familiarity with Prometheus query language (PromQL) is helpful but not required
This tutorial uses Docker Compose for quick setup. For production deployments, you'll want to run Prometheus and Grafana on dedicated infrastructure with persistent storage, high availability, and proper security configurations.
Step 1: Understand Expanso's Metrics Architecture
Before diving into configuration, let's understand what metrics Expanso exposes and how they flow through the observability stack.
Expanso uses OpenTelemetry for metrics export, which means it can send data to any OTLP-compatible collector. The architecture looks like this:
Key metric categories:
- Process metrics (CPU, memory, file descriptors) - Always enabled
- Go runtime metrics (GC, goroutines, heap) - Optional, disabled by default
- Job execution metrics (state, throughput, errors) - Application-specific
- Node connection metrics (heartbeats, session state) - Network health
- Deployment metrics (rollout progress, health checks) - Job lifecycle
For this tutorial, we'll use the OpenTelemetry Collector in Prometheus mode, which directly scrapes metrics and stores them.
Step 2: Configure Orchestrator to Export Metrics
Let's configure the Expanso orchestrator to export metrics via OpenTelemetry.
Update your orchestrator configuration:
Add the telemetry section to /etc/expanso/orchestrator-config.yaml. Download the complete configuration:
curl -O https://docs.expanso.io/examples/monitoring/orchestrator-telemetry-config.yaml
Or view the configuration file
Key settings:
endpoint: OTLP collector address (localhost:4317 for co-located collector)export_interval: How often to export metrics (15s recommended)include_go_metrics: Enable detailed Go runtime metricsresource_attributes: Add custom labels for filtering in Prometheus
Restart the orchestrator to apply the configuration:
sudo systemctl restart expanso-orchestrator
# Verify telemetry is initialized
sudo journalctl -u expanso-orchestrator -n 50 | grep -i telemetry
You should see log entries indicating telemetry initialization:
INFO Telemetry enabled: endpoint=localhost:4317 protocol=grpc
INFO Process metrics collector initialized: interval=15s
INFO Go metrics collector initialized: interval=15s
Until we start the OTLP collector in the next step, you'll see connection errors in the logs. This is expected —the orchestrator will retry connecting automatically.
Step 3: Configure Edge Nodes to Export Metrics
Now let's configure edge nodes to send metrics to the same collector.
Update each edge node's configuration:
Add the telemetry section to /etc/expanso/edge-config.yaml. Download the configuration:
curl -O https://docs.expanso.io/examples/monitoring/edge-telemetry-config.yaml
Or view the configuration file
Important: Update the endpoint to point to your OTLP collector (typically the orchestrator hostname).
Restart edge nodes:
# On each edge node
sudo systemctl restart expanso-edge
# Verify telemetry configuration
sudo journalctl -u expanso-edge -n 20 | grep -i telemetry
In this tutorial, edge nodes send metrics directly to a central OTLP collector. For large-scale deployments, you can run a local collector on each edge site and aggregate to a central Prometheus instance. This reduces network traffic and improves resilience.
Step 4: Set Up the OpenTelemetry Collector and Prometheus
Now let's deploy the observability stack using Docker Compose.
Create a directory for monitoring configuration:
mkdir -p ~/expanso-monitoring/{prometheus,grafana,otel-collector}
cd ~/expanso-monitoring
Create the OpenTelemetry Collector configuration:
Download the configuration file:
curl -o otel-collector/config.yaml https://docs.expanso.io/examples/monitoring/otel-collector-config.yaml
Or view the configuration file
Create the Prometheus configuration:
Download the configuration file:
curl -o prometheus/prometheus.yml https://docs.expanso.io/examples/monitoring/prometheus.yml
Or view the configuration file
Create the Docker Compose file:
Download the complete stack configuration:
curl -o docker-compose.yml https://docs.expanso.io/examples/monitoring/docker-compose.yml
Or view the configuration file
This sets up:
- OTLP Collector on ports 4317 (gRPC) and 4318 (HTTP)
- Prometheus on port 9091 (web UI)
- Grafana on port 3000
Start the monitoring stack:
docker compose up -d
# Verify all containers are running
docker compose ps
You should see all three containers running:
NAME STATUS PORTS
expanso-otel-collector Up 10 seconds 0.0.0.0:4317-4318->4317-4318/tcp
expanso-prometheus Up 10 seconds 0.0.0.0:9091->9090/tcp
expanso-grafana Up 10 seconds 0.0.0.0:3000->3000/tcp
Verify metrics are flowing:
# Check OTLP collector logs
docker compose logs -f otel-collector
# You should see messages about receiving metrics
If your orchestrator and edge nodes are on different machines, make sure port 4317 (OTLP gRPC) is open on the collector host. Use sudo ufw allow 4317/tcp or equivalent for your firewall.
Step 5: Verify Metrics in Prometheus
Let's verify that Prometheus is collecting metrics from Expanso components.
Open the Prometheus web UI:
Navigate to http://localhost:9091 in your browser.
Run a basic query to verify data collection:
In the query box, enter:
expanso_process_cpu_utilization_ratio
Click "Execute" and switch to the "Graph" tab. You should see CPU utilization metrics for both the orchestrator and any connected edge nodes.
Check available metrics:
Click the "Graph" dropdown and explore the available metrics. You should see:
Process Metrics:
- expanso_process_cpu_utilization_ratio
- expanso_process_memory_usage_bytes
- expanso_process_memory_virtual_bytes
- expanso_process_open_file_descriptors_ratio
- expanso_process_max_file_descriptors_ratio
Go Runtime Metrics (if enabled):
- expanso_go_memory_heap_alloc_bytes
- expanso_go_memory_heap_sys_bytes
- expanso_go_gc_cycles_total
- expanso_go_gc_pause_last_seconds
- expanso_go_goroutines_count
- expanso_go_max_procs
Query metrics by service:
To see only orchestrator metrics:
expanso_process_memory_usage_bytes{service_name="expanso-orchestrator"}
To see only edge node metrics:
expanso_process_memory_usage_bytes{service_name="expanso-edge"}
Check metric cardinality:
Count how many unique service instances are reporting:
count by (service_name, node_id) (expanso_process_cpu_utilization_ratio)
You should see one entry for the orchestrator and one for each connected edge node.
See the Metrics Not Appearing in Prometheus troubleshooting section below.
Step 6: Create Grafana Dashboards for Fleet Monitoring
Now let's create comprehensive dashboards to visualize your edge fleet's health.
Log into Grafana:
Navigate to http://localhost:3000 and log in with:
- Username:
admin - Password:
admin
You'll be prompted to change the password on first login.
Add Prometheus as a data source:
- Click the gear icon (⚙️) in the left sidebar → "Data sources"
- Click "Add data source"
- Select "Prometheus"
- Configure:
- Name:
Expanso Prometheus - URL:
http://prometheus:9090 - Access:
Server (default)
- Name:
- Click "Save & Test"
You should see a success message: "Successfully queried the Prometheus API."
Create the Fleet Overview Dashboard:
- Click the "+" icon → "Create Dashboard"
- Click "Add visualization"
- Select "Expanso Prometheus" as the data source
Let's create several panels:
Panel 1: Node Connection Status
This panel shows which nodes are connected and healthy.
# Query
count by (node_id, node_hostname) (
expanso_process_cpu_utilization_ratio{service_name="expanso-edge"}
)
- Panel type: Stat
- Title: Connected Edge Nodes
- Description: "Number of edge nodes currently reporting metrics"
- Value options: Last (not null)
- Color scheme: Green-Yellow-Red (by value)
- Thresholds:
- Green: > 0
- Red: 0
Panel 2: CPU Utilization by Node
Shows CPU usage across all edge nodes and orchestrator.
# Query
expanso_process_cpu_utilization_ratio * 100
- Panel type: Time series
- Title: CPU Utilization (%)
- Legend:
{{service_name}} - {{node_hostname}} - Unit: Percent (0-100)
- Y-axis min: 0
- Y-axis max: 100
Panel 3: Memory Usage by Node
Tracks memory consumption across the fleet.
# Query (convert to MB)
expanso_process_memory_usage_bytes / 1024 / 1024
- Panel type: Time series
- Title: Memory Usage (MB)
- Legend:
{{service_name}} - {{node_hostname}} - Unit: megabytes
Panel 4: Goroutine Count (Go Runtime)
Monitor goroutine growth to detect leaks.
# Query
expanso_go_goroutines_count
- Panel type: Time series
- Title: Goroutine Count
- Legend:
{{service_name}} - Y-axis min: 0
Panel 5: Garbage Collection Frequency
Track GC activity across services.
# Query: GC cycles per minute
rate(expanso_go_gc_cycles_total[5m]) * 60
- Panel type: Time series
- Title: GC Cycles per Minute
- Legend:
{{service_name}} - Unit: ops/min
Panel 6: File Descriptor Usage
Monitor file descriptor utilization to prevent exhaustion.
# Query: FD usage percentage
(expanso_process_open_file_descriptors_ratio /
expanso_process_max_file_descriptors_ratio) * 100
- Panel type: Gauge
- Title: File Descriptor Usage (%)
- Legend:
{{service_name}} - Unit: Percent (0-100)
- Thresholds:
- Green: < 50
- Yellow: 50-80
- Red: > 80
Save the dashboard:
Click the save icon (💾) at the top right, name it "Expanso Fleet Overview", and click "Save".
For production dashboards, add variables to filter by environment, region, or specific nodes. In dashboard settings, add variables like $environment or $node_id to make dashboards interactive.
Step 7: Monitor Job Execution Health
Let's create a dashboard focused on job execution state and throughput.
Create a new dashboard: Click "+" → "Create Dashboard"
Panel 1: Jobs by Execution State
This requires metrics from the orchestrator's job controller. While Expanso doesn't expose execution state as Prometheus metrics by default (it's accessible via the API), we can infer health from other signals.
For now, let's monitor job execution indirectly through resource usage patterns.
Panel 2: Edge Node Heartbeat Latency
Monitor the time between heartbeats to detect connectivity issues.
# Query: Time since last heartbeat (approximation)
time() - timestamp(expanso_process_cpu_utilization_ratio{service_name="expanso-edge"})
- Panel type: Time series
- Title: Heartbeat Freshness (seconds)
- Legend:
{{node_hostname}} - Unit: seconds
- Alert threshold: > 90 seconds (3 missed heartbeats)
Panel 3: Process Memory Growth Rate
Detect memory leaks by tracking growth rate.
# Query: Memory growth in MB per hour
rate(expanso_process_memory_usage_bytes[1h]) * 3600 / 1024 / 1024
- Panel type: Time series
- Title: Memory Growth Rate (MB/hour)
- Legend:
{{service_name}} - {{node_hostname}} - Unit: MB/h
Panel 4: GC Pause Times
Monitor garbage collection impact on performance.
# Query: Last GC pause in milliseconds
expanso_go_gc_pause_last_seconds * 1000
- Panel type: Time series
- Title: GC Pause Time (ms)
- Legend:
{{service_name}} - Unit: milliseconds
- Alert threshold: > 100ms
Panel 5: CPU Utilization Heatmap
Visualize CPU patterns across all nodes.
# Query
expanso_process_cpu_utilization_ratio * 100
- Panel type: Heatmap
- Title: CPU Utilization Heatmap
- Legend:
{{node_hostname}} - Color scheme: Green-Yellow-Red
Save this dashboard as "Expanso Job & Node Health".
Step 8: Set Up Alerts for Critical Issues
Let's configure Prometheus alerting rules for common edge infrastructure problems.
Create the alerts directory:
mkdir -p ~/expanso-monitoring/prometheus/alerts
Create alert rules for node health:
Download the alert rules:
mkdir -p prometheus/alerts
curl -o prometheus/alerts/node-health.yml https://docs.expanso.io/examples/monitoring/node-health-alerts.yml
This includes alerts for:
- Edge node disconnections (>2 minutes of missed metrics)
- Orchestrator downtime
- High CPU usage (>85% for 5 minutes)
- High memory usage (>1GB for 5 minutes)
Create alert rules for resource exhaustion:
Download the resource limit alerts:
curl -o prometheus/alerts/resource-limits.yml https://docs.expanso.io/examples/monitoring/resource-limits-alerts.yml
This includes alerts for:
- High file descriptor usage (>80%)
- Memory leak detection (>10MB/s growth)
- Goroutine leaks (>5000 goroutines)
- Excessive GC activity (>2 cycles/second)
Reload Prometheus configuration:
# Reload config without restarting
curl -X POST http://localhost:9091/-/reload
Verify alerts are loaded:
Navigate to http://localhost:9091/alerts in Prometheus UI. You should see all configured alerts listed.
Check alert states:
The alerts will show as:
- Green (Inactive): No issues detected
- Yellow (Pending): Issue detected but waiting for
forduration - Red (Firing): Alert is active
Step 9: Access and Correlate Logs
Metrics tell you when something's wrong, but logs tell you why. Let's integrate log access into your monitoring workflow.
Accessing orchestrator logs:
# View recent orchestrator logs
expanso orchestrator logs --tail=100
# Follow logs in real-time
expanso orchestrator logs --follow
# Filter for errors
expanso orchestrator logs --tail=500 | grep -i error
# Filter for specific node
expanso orchestrator logs --tail=500 | grep node-a1b2c3d4e5f6
Accessing edge node logs:
# On the edge node machine
sudo journalctl -u expanso-edge -n 100
# Follow in real-time
sudo journalctl -u expanso-edge -f
# Filter by time range
sudo journalctl -u expanso-edge --since "10 minutes ago"
# Export to file for analysis
sudo journalctl -u expanso-edge --since "1 hour ago" > edge-debug.log
Accessing job execution logs:
# List executions for a job
expanso job executions syslog-processor
# Get logs from a specific execution
expanso execution logs exec-xyz789 --tail=200
# Follow execution logs in real-time
expanso execution logs exec-xyz789 --follow
Correlating metrics with logs:
When you see an alert fire in Grafana, here's how to debug:
- Note the timestamp from the Grafana panel
- Identify the affected node from the metric labels
- Query logs around that time:
# If orchestrator CPU spiked at 15:30
expanso orchestrator logs --since "15:25" --until "15:35" | grep -i cpu
# If edge node disconnected
ssh user@edge-node
sudo journalctl -u expanso-edge --since "15:25" --until "15:35"
Common log patterns to look for:
# Node connection issues
grep "connection refused\|timeout\|network error" edge-debug.log
# Job execution failures
grep "execution failed\|pipeline error\|fatal" orchestrator.log
# Resource exhaustion
grep "out of memory\|too many open files\|disk full" *.log
# Authentication problems
grep "unauthorized\|authentication failed\|invalid token" *.log
Step 10: Debug Common Issues Using Metrics
Let's walk through real debugging scenarios using the monitoring stack you've built.
Scenario 1: Edge Node Shows Intermittent Disconnections
Symptom: Grafana shows gaps in metrics for a specific edge node.
Investigation:
- Check the heartbeat freshness panel—you see periodic spikes to 120+ seconds
- Query Prometheus for network quality:
# Gaps in metrics collection
resets(expanso_process_cpu_utilization_ratio{node_id="node-abc123"}[1h])
- SSH to the edge node and check network:
# Ping orchestrator
ping -c 10 orchestrator.example.com
# Check packet loss
mtr orchestrator.example.com
# Review network errors
sudo journalctl -u expanso-edge -n 200 | grep -i "connection\|network"
Common causes:
- Unstable WiFi/cellular connection (increase heartbeat interval)
- Firewall intermittently blocking port 9090
- Network congestion during peak hours
Solution: Increase heartbeat tolerance or improve network infrastructure.
Scenario 2: Orchestrator Memory Usage Growing Continuously
Symptom: Memory growth rate panel shows consistent positive trend.
Investigation:
- Check Go heap metrics in Grafana:
# Heap allocation over time
expanso_go_memory_heap_alloc_bytes{service_name="expanso-orchestrator"}
# Memory growth rate
rate(expanso_go_memory_heap_alloc_bytes[1h])
- Compare heap vs. system memory:
# If heap is stable but RSS grows, it's a different issue
expanso_process_memory_usage_bytes{service_name="expanso-orchestrator"} -
expanso_go_memory_heap_alloc_bytes{service_name="expanso-orchestrator"}
- Check goroutine count for leaks:
expanso_go_goroutines_count{service_name="expanso-orchestrator"}
- If goroutines are growing, get a goroutine dump:
# Enable pprof endpoint (if configured)
curl http://localhost:6060/debug/pprof/goroutine?debug=1 > goroutines.txt
# Analyze the dump for stuck goroutines
grep -A 5 "created by" goroutines.txt | sort | uniq -c | sort -rn
Common causes:
- Goroutine leaks from unclosed connections
- Cached data not being evicted
- Large number of historical records in state store
Solution: Review the goroutine dump, fix leaks, or implement cache size limits.
Scenario 3: Job Execution Not Processing Data
Symptom: Job is marked as "running" but no output is produced.
Investigation:
- Check if the edge node is connected:
expanso node list
- Get execution details:
expanso execution describe exec-xyz789
- Check execution logs for errors:
expanso execution logs exec-xyz789 --tail=100
- Look for resource constraints on the edge node:
# CPU at limits?
expanso_process_cpu_utilization_ratio{node_id="node-abc123"} > 0.95
# Memory at limits?
expanso_process_memory_usage_bytes{node_id="node-abc123"}
- SSH to edge node and check local pipeline status:
# Check if pipeline process is running
expanso-edge status
# Check local buffer for backed-up data
ls -lh /var/lib/expanso/buffer/
Common causes:
- Input source (file, network) is not producing data
- Pipeline filter is dropping all messages
- Output destination is unreachable (network partition)
- Resource limits preventing processing
Solution: Review pipeline configuration, check input/output connectivity, verify resource allocation.
Step 11: Track Deployment Progress
When you roll out job updates, monitoring deployment progress prevents issues from affecting your entire fleet.
Monitor rolling deployment progress:
# Start a rolling deployment
expanso job deploy updated-job.yaml
# Watch deployment status
watch -n 2 'expanso deployment list'
In Grafana, create a deployment tracking panel:
# Count nodes by execution version
count by (version) (
# This requires custom metrics from the orchestrator
# For now, monitor via API or CLI
)
Best practices for deployment monitoring:
- Watch for failures during rollout:
# Monitor execution health during deployment
watch -n 5 'expanso job executions my-job | grep -E "failed|error"'
- Verify new version before proceeding:
# Check health of new version executions
expanso job executions my-job --version=2 | grep healthy
- Roll back if issues detected:
# If new version shows problems
expanso job rerun my-job --version=1
Create a deployment dashboard panel:
Add to your Grafana dashboard to show deployment activity:
# Recent configuration changes (approximation via process restarts)
changes(expanso_process_cpu_utilization_ratio[5m])
Step 12: Create Custom Alerts for Your Workload
Generic alerts are a start, but you'll want workload-specific alerts based on your edge computing use case.
Example: Log Processing Pipeline Alerts
If you're processing syslog data:
groups:
- name: syslog_pipeline
interval: 30s
rules:
# Alert if log processing throughput drops
- alert: LowLogThroughput
expr: |
# This requires custom metrics from your pipeline
# Placeholder: monitor CPU as proxy for activity
avg(rate(expanso_process_cpu_utilization_ratio{
service_name="expanso-edge"
}[5m])) < 0.05
for: 10m
labels:
severity: warning
workload: syslog
annotations:
summary: "Log processing throughput is low"
description: "Average edge node CPU usage is below 5%, indicating low log volume or processing issues"
Example: IoT Data Collection Alerts
If you're collecting sensor data:
groups:
- name: iot_data_collection
interval: 30s
rules:
# Alert if edge nodes stop collecting data
- alert: DataCollectionStalled
expr: |
# Monitor memory growth as proxy for data buffering
rate(expanso_process_memory_usage_bytes{
service_name="expanso-edge"
}[10m]) == 0
for: 15m
labels:
severity: warning
workload: iot
annotations:
summary: "Data collection may have stalled on {{ $labels.node_hostname }}"
description: "Edge node memory is not growing, which may indicate no data is being buffered"
Reload Prometheus to apply custom alerts:
curl -X POST http://localhost:9091/-/reload
Verification Checklist
Let's verify your complete monitoring setup:
- ✅ Orchestrator configured to export metrics via OTLP
- ✅ Edge nodes configured to export metrics via OTLP
- ✅ OpenTelemetry Collector receiving metrics from all components
- ✅ Prometheus scraping metrics from OTLP Collector
- ✅ Grafana connected to Prometheus data source
- ✅ Fleet Overview dashboard created with key health metrics
- ✅ Job & Node Health dashboard created
- ✅ Alert rules configured for node disconnections
- ✅ Alert rules configured for resource exhaustion
- ✅ Alerts visible in Prometheus UI
- ✅ Log access configured for orchestrator and edge nodes
- ✅ Able to correlate metrics with logs for debugging
- ✅ Tested at least one debugging scenario
- ✅ Custom workload-specific alerts configured
If all items are checked, congratulations! You have production-grade monitoring for your edge fleet.
What You Learned
You've built a comprehensive monitoring stack for Expanso:
- ✅ Configured OpenTelemetry metrics export from orchestrator and edge nodes
- ✅ Deployed and configured Prometheus for metrics collection
- ✅ Created Grafana dashboards for fleet health visualization
- ✅ Set up process metrics (CPU, memory, file descriptors)
- ✅ Enabled Go runtime metrics (GC, goroutines, heap)
- ✅ Configured alerts for critical issues (node disconnections, resource exhaustion)
- ✅ Integrated log access for debugging
- ✅ Practiced debugging common edge infrastructure problems
- ✅ Monitored deployment progress and health
- ✅ Created custom workload-specific alerts
Key Concepts
OpenTelemetry (OTLP): Industry-standard protocol for exporting telemetry data (metrics, traces, logs). Expanso uses OTLP to send metrics to collectors, making it compatible with any OTLP-compatible backend.
Process Metrics: OS-level metrics about running processes including CPU utilization, memory usage (RSS and virtual), and file descriptor counts. Always enabled in Expanso.
Go Runtime Metrics: Go-specific metrics about garbage collection, goroutine counts, heap allocation, and memory management. Optional in Expanso, enabled via include_go_metrics: true.
Heartbeats: Periodic health reports from edge nodes to the orchestrator (every 30 seconds by default). Missing heartbeats indicate connectivity issues or node failures.
Metric Cardinality: The number of unique time series in Prometheus. High cardinality (many unique label combinations) increases storage and query costs. Monitor cardinality in production deployments.
PromQL (Prometheus Query Language): Query language for retrieving and aggregating metrics from Prometheus. Supports functions like rate(), increase(), histogram_quantile() for analysis.
Alert Rules: Prometheus expressions that evaluate to true/false. When an alert rule evaluates to true for the specified for duration, the alert fires.
Observability Pillars: Metrics (what's happening), logs (why it's happening), traces (how it's happening). This tutorial covered metrics and logs; tracing is covered in advanced guides.
Next Steps
Now that you have comprehensive monitoring, here's what to explore next:
Advanced Monitoring:
Alerting and Incident Response:
Performance Optimization:
Production Operations:
Architecture Understanding:
Troubleshooting
Metrics Not Appearing in Prometheus
Symptom: Prometheus UI shows no metrics with expanso_ prefix.
Diagnosis:
# Check OTLP collector is receiving data
docker compose logs otel-collector | grep -i "received"
# Check Prometheus is scraping successfully
curl http://localhost:9091/api/v1/targets | jq
Common causes:
- OTLP collector not reachable: Verify network connectivity from orchestrator/edge nodes
- Telemetry disabled: Check
do_not_track: falsein configs - Wrong endpoint: Verify
endpointmatches OTLP collector address - Firewall blocking: Ensure port 4317 (gRPC) or 4318 (HTTP) is open
Solution:
# Test connectivity from orchestrator
nc -zv localhost 4317
# Check telemetry logs
sudo journalctl -u expanso-orchestrator | grep -i telemetry
# Restart with debug logging
# In config: logging.level = "debug"
sudo systemctl restart expanso-orchestrator
Alerts Not Firing When Expected
Symptom: Alert conditions are met but alerts don't show as firing.
Diagnosis:
# Check alert rule syntax
curl http://localhost:9091/api/v1/rules | jq
# Manually evaluate alert expression
# In Prometheus UI, run the alert's expr query
Common causes:
forduration not elapsed: Alert is pending, wait for full duration- Syntax error in rule: Check YAML indentation and PromQL syntax
- Label mismatch: Alert expr doesn't match any actual metrics
- Prometheus not reloaded: Alert rules weren't reloaded after changes
Solution:
# Validate alert rule syntax
promtool check rules prometheus/alerts/*.yml
# Reload Prometheus config
curl -X POST http://localhost:9091/-/reload
# Check Prometheus logs for errors
docker compose logs prometheus | grep -i error
Grafana Shows "No Data"
Symptom: Grafana panels show "No data" despite Prometheus having metrics.
Diagnosis:
- Test the query in Prometheus UI first
- Check the time range in Grafana (top-right)
- Verify data source is correctly configured
Common causes:
- Query syntax error: PromQL syntax differs between Prometheus and Grafana
- Time range mismatch: Data exists but not in selected time range
- Data source issue: Grafana can't reach Prometheus
- Legend template error: Invalid label syntax in legend
Solution:
# Test Prometheus data source in Grafana
# Settings → Data Sources → Expanso Prometheus → Save & Test
# Check Grafana logs
docker compose logs grafana | grep -i error
# Verify Prometheus is reachable from Grafana container
docker compose exec grafana curl http://prometheus:9090/api/v1/query?query=up
High Metric Cardinality Warning
Symptom: Prometheus logs warn about high cardinality or memory usage.
Diagnosis:
# Count unique time series
count({__name__=~"expanso_.*"})
# Find metrics with most series
topk(10, count by (__name__) ({__name__=~"expanso_.*"}))
# Identify high-cardinality labels
count by (node_id, service_name) ({__name__=~"expanso_.*"})
Common causes:
- Many unique node IDs: Normal for large fleets, consider sampling
- High-cardinality labels: Labels with many unique values (timestamps, IDs)
- Metrics explosion: New metrics added without considering cardinality impact
Solution:
# In Prometheus config, add metric relabeling to drop high-cardinality labels
scrape_configs:
- job_name: 'expanso-metrics'
metric_relabel_configs:
# Drop labels with high cardinality if not needed
- source_labels: [execution_id]
action: labeldrop
Edge Node Metrics Delayed or Inconsistent
Symptom: Some edge nodes show stale metrics or gaps in data.
Diagnosis:
# Check edge node telemetry logs
ssh user@edge-node
sudo journalctl -u expanso-edge | grep -i "telemetry\|export"
# Check network latency to collector
ping -c 10 orchestrator.example.com
Common causes:
- Network latency: High latency to OTLP collector
- Export interval too short: Increase
export_intervalfor edge nodes - Collector overload: OTLP collector can't handle export rate
- Batch size too small: Increase batch size in OTLP collector config
Solution:
# In edge node config, increase export interval for unstable networks
telemetry:
export_interval: "60s" # Increase from 15s to 60s
# In OTLP collector, increase batch size
processors:
batch:
timeout: 30s
send_batch_size: 2048 # Increase from 1024
Need More Help?
If you're still experiencing issues:
-
Enable debug logging in all components:
logging:
level: debug -
Collect diagnostic information:
# Orchestrator diagnostics
expanso orchestrator logs --tail=500 > orch-debug.log
# Prometheus diagnostics
curl http://localhost:9091/api/v1/status/tsdb > prom-status.json
# OTLP collector diagnostics
docker compose logs otel-collector > otel-debug.log -
Check the documentation:
-
Community support:
- Search GitHub Issues
- Ask on Discord