Skip to main content

Monitor Your Edge Fleet

You've deployed jobs to edge nodes, and they're processing data across your distributed infrastructure. But how do you know everything's working? When a node goes offline, when a job fails, or when throughput drops—you need to know immediately.

In this tutorial, you'll build a complete monitoring stack for Expanso using Prometheus and Grafana. You'll set up metrics collection, create dashboards to visualize fleet health, configure alerts for critical issues, and learn how to debug problems using observability data. By the end, you'll have production-grade monitoring that keeps you informed about your edge infrastructure.

This tutorial takes about 45-60 minutes to complete.

What You'll Learn

  • How to configure Expanso to export metrics via OpenTelemetry
  • How to set up Prometheus to collect metrics from orchestrator and edge nodes
  • How to create Grafana dashboards for job and node health visualization
  • How to monitor job execution states and throughput
  • How to track node connectivity, heartbeats, and resources
  • How to configure alerts for node disconnections, job failures, and performance issues
  • How to correlate logs with metrics for debugging
  • How to identify and troubleshoot common edge infrastructure problems

Prerequisites

Before starting, make sure you have:

  • A running Expanso deployment with orchestrator and at least one edge node (from the First Edge Deployment tutorial)
  • At least one deployed job actively processing data
  • Docker or Podman installed for running Prometheus and Grafana
  • Basic familiarity with Prometheus query language (PromQL) is helpful but not required
Production vs. Development

This tutorial uses Docker Compose for quick setup. For production deployments, you'll want to run Prometheus and Grafana on dedicated infrastructure with persistent storage, high availability, and proper security configurations.

Step 1: Understand Expanso's Metrics Architecture

Before diving into configuration, let's understand what metrics Expanso exposes and how they flow through the observability stack.

Expanso uses OpenTelemetry for metrics export, which means it can send data to any OTLP-compatible collector. The architecture looks like this:

Key metric categories:

  1. Process metrics (CPU, memory, file descriptors) - Always enabled
  2. Go runtime metrics (GC, goroutines, heap) - Optional, disabled by default
  3. Job execution metrics (state, throughput, errors) - Application-specific
  4. Node connection metrics (heartbeats, session state) - Network health
  5. Deployment metrics (rollout progress, health checks) - Job lifecycle
  6. Disk metrics (available space, directory sizes) - Always enabled

For this tutorial, we'll use the OpenTelemetry Collector in Prometheus mode, which directly scrapes metrics and stores them.

Step 2: Configure Orchestrator to Export Metrics

Let's configure the Expanso orchestrator to export metrics via OpenTelemetry.

Update your orchestrator configuration:

Add the telemetry section to /etc/expanso/orchestrator-config.yaml. Download the complete configuration:

curl -O https://docs.expanso.io/examples/monitoring/orchestrator-telemetry-config.yaml

Or view the configuration file

Key settings:

  • endpoint: OTLP collector address (localhost:4317 for co-located collector)
  • export_interval: How often to export metrics (15s recommended)
  • include_go_metrics: Enable detailed Go runtime metrics
  • resource_attributes: Add custom labels for filtering in Prometheus

Restart the orchestrator to apply the configuration:

sudo systemctl restart expanso-orchestrator

# Verify telemetry is initialized
sudo journalctl -u expanso-orchestrator -n 50 | grep -i telemetry

You should see log entries indicating telemetry initialization:

INFO Telemetry enabled: endpoint=localhost:4317 protocol=grpc
INFO Process metrics collector initialized: interval=15s
INFO Go metrics collector initialized: interval=15s
Connection Errors Are Normal

Until we start the OTLP collector in the next step, you'll see connection errors in the logs. This is expected—the orchestrator will retry connecting automatically.

Step 3: Configure Edge Nodes to Export Metrics

Now let's configure edge nodes to send metrics to the same collector.

Update each edge node's configuration:

Add the telemetry section to /etc/expanso/edge-config.yaml. Download the configuration:

curl -O https://docs.expanso.io/examples/monitoring/edge-telemetry-config.yaml

Or view the configuration file

Important: Update the endpoint to point to your OTLP collector (typically the orchestrator hostname).

Restart edge nodes:

# On each edge node
sudo systemctl restart expanso-edge

# Verify telemetry configuration
sudo journalctl -u expanso-edge -n 20 | grep -i telemetry
Centralized vs. Distributed Collection

In this tutorial, edge nodes send metrics directly to a central OTLP collector. For large-scale deployments, you can run a local collector on each edge site and aggregate to a central Prometheus instance. This reduces network traffic and improves resilience.

Step 4: Set Up the OpenTelemetry Collector and Prometheus

Now let's deploy the observability stack using Docker Compose.

Create a directory for monitoring configuration:

mkdir -p ~/expanso-monitoring/{prometheus,grafana,otel-collector}
cd ~/expanso-monitoring

Create the OpenTelemetry Collector configuration:

Download the configuration file:

curl -o otel-collector/config.yaml https://docs.expanso.io/examples/monitoring/otel-collector-config.yaml

Or view the configuration file

Create the Prometheus configuration:

Download the configuration file:

curl -o prometheus/prometheus.yml https://docs.expanso.io/examples/monitoring/prometheus.yml

Or view the configuration file

Create the Docker Compose file:

Download the complete stack configuration:

curl -o docker-compose.yml https://docs.expanso.io/examples/monitoring/docker-compose.yml

Or view the configuration file

This sets up:

  • OTLP Collector on ports 4317 (gRPC) and 4318 (HTTP)
  • Prometheus on port 9091 (web UI)
  • Grafana on port 3000

Start the monitoring stack:

docker compose up -d

# Verify all containers are running
docker compose ps

You should see all three containers running:

NAME                      STATUS          PORTS
expanso-otel-collector Up 10 seconds 0.0.0.0:4317-4318->4317-4318/tcp
expanso-prometheus Up 10 seconds 0.0.0.0:9091->9090/tcp
expanso-grafana Up 10 seconds 0.0.0.0:3000->3000/tcp

Verify metrics are flowing:

# Check OTLP collector logs
docker compose logs -f otel-collector

# You should see messages about receiving metrics
Firewall Configuration

If your orchestrator and edge nodes are on different machines, make sure port 4317 (OTLP gRPC) is open on the collector host. Use sudo ufw allow 4317/tcp or equivalent for your firewall.

Step 5: Verify Metrics in Prometheus

Let's verify that Prometheus is collecting metrics from Expanso components.

Open the Prometheus web UI:

Navigate to http://localhost:9091 in your browser.

Run a basic query to verify data collection:

In the query box, enter:

expanso_process_cpu_utilization_ratio

Click "Execute" and switch to the "Graph" tab. You should see CPU utilization metrics for both the orchestrator and any connected edge nodes.

Check available metrics:

Click the "Graph" dropdown and explore the available metrics. You should see:

Process Metrics:
- expanso_process_cpu_utilization_ratio
- expanso_process_memory_usage_bytes
- expanso_process_memory_virtual_bytes
- expanso_process_open_file_descriptors_ratio
- expanso_process_max_file_descriptors_ratio

Go Runtime Metrics (if enabled):
- expanso_go_memory_heap_alloc_bytes
- expanso_go_memory_heap_sys_bytes
- expanso_go_gc_cycles_total
- expanso_go_gc_pause_last_seconds
- expanso_go_goroutines_count
- expanso_go_max_procs

Log Streaming Metrics (orchestrator only):
- websocket_proxy_connections_active
- websocket_proxy_connections_total
- websocket_proxy_messages_client_to_server
- websocket_proxy_messages_server_to_client
- websocket_proxy_disconnections_total
- websocket_proxy_connections_duration (histogram)
- logstream_consumers_active
- logstream_consumers_connects
- logstream_consumers_disconnects
- logstream_streams_enabled
- logstream_streams_disabled
- logstream_cleanup_runs
- logstream_cleanup_duration (histogram)
- logstream_cleanup_errors

NATS Transport Metrics:
- nats_connection_status (1=connected, 0=disconnected) - available on both orchestrators and edge nodes

Disk Metrics:
- disk_available (available free space on data directory filesystem)
- disk_state_size (total size of the state directory)
- disk_executions_size (total size of the executions directory, edge only)

Query metrics by service:

To see only orchestrator metrics:

expanso_process_memory_usage_bytes{service_name="expanso-orchestrator"}

To see only edge node metrics:

expanso_process_memory_usage_bytes{service_name="expanso-edge"}

Check metric cardinality:

Count how many unique service instances are reporting:

count by (service_name, node_id) (expanso_process_cpu_utilization_ratio)

You should see one entry for the orchestrator and one for each connected edge node.

No metrics appearing?

See the Metrics Not Appearing in Prometheus troubleshooting section below.

Step 6: Create Grafana Dashboards for Fleet Monitoring

Now let's create comprehensive dashboards to visualize your edge fleet's health.

Log into Grafana:

Navigate to http://localhost:3000 and log in with:

  • Username: admin
  • Password: admin

You'll be prompted to change the password on first login.

Add Prometheus as a data source:

  1. Click the gear icon (⚙️) in the left sidebar → "Data sources"
  2. Click "Add data source"
  3. Select "Prometheus"
  4. Configure:
    • Name: Expanso Prometheus
    • URL: http://prometheus:9090
    • Access: Server (default)
  5. Click "Save & Test"

You should see a success message: "Successfully queried the Prometheus API."

Create the Fleet Overview Dashboard:

  1. Click the "+" icon → "Create Dashboard"
  2. Click "Add visualization"
  3. Select "Expanso Prometheus" as the data source

Let's create several panels:

Panel 1: Node Connection Status

This panel shows which nodes are connected and healthy.

# Query
count by (node_id, node_hostname) (
expanso_process_cpu_utilization_ratio{service_name="expanso-edge"}
)
  • Panel type: Stat
  • Title: Connected Edge Nodes
  • Description: "Number of edge nodes currently reporting metrics"
  • Value options: Last (not null)
  • Color scheme: Green-Yellow-Red (by value)
  • Thresholds:
    • Green: > 0
    • Red: 0

Panel 2: CPU Utilization by Node

Shows CPU usage across all edge nodes and orchestrator.

# Query
expanso_process_cpu_utilization_ratio * 100
  • Panel type: Time series
  • Title: CPU Utilization (%)
  • Legend: {{service_name}} - {{node_hostname}}
  • Unit: Percent (0-100)
  • Y-axis min: 0
  • Y-axis max: 100

Panel 3: Memory Usage by Node

Tracks memory consumption across the fleet.

# Query (convert to MB)
expanso_process_memory_usage_bytes / 1024 / 1024
  • Panel type: Time series
  • Title: Memory Usage (MB)
  • Legend: {{service_name}} - {{node_hostname}}
  • Unit: megabytes

Panel 4: Goroutine Count (Go Runtime)

Monitor goroutine growth to detect leaks.

# Query
expanso_go_goroutines_count
  • Panel type: Time series
  • Title: Goroutine Count
  • Legend: {{service_name}}
  • Y-axis min: 0

Panel 5: Garbage Collection Frequency

Track GC activity across services.

# Query: GC cycles per minute
rate(expanso_go_gc_cycles_total[5m]) * 60
  • Panel type: Time series
  • Title: GC Cycles per Minute
  • Legend: {{service_name}}
  • Unit: ops/min

Panel 6: File Descriptor Usage

Monitor file descriptor utilization to prevent exhaustion.

# Query: FD usage percentage
(expanso_process_open_file_descriptors_ratio /
expanso_process_max_file_descriptors_ratio) * 100
  • Panel type: Gauge
  • Title: File Descriptor Usage (%)
  • Legend: {{service_name}}
  • Unit: Percent (0-100)
  • Thresholds:
    • Green: < 50
    • Yellow: 50-80
    • Red: > 80

Save the dashboard:

Click the save icon (💾) at the top right, name it "Expanso Fleet Overview", and click "Save".

Dashboard Variables

For production dashboards, add variables to filter by environment, region, or specific nodes. In dashboard settings, add variables like $environment or $node_id to make dashboards interactive.

Import the Ops Overview Dashboard

Start with the Ops Overview dashboard—it gives you a cross-environment view of your entire system and is the best starting point for investigating system health.

Download the dashboard:

# Clone or navigate to the Expanso repository
git clone https://github.com/expanso-io/expanso.git
cd expanso/docs/grafana/dashboards/
DashboardFileWhat It Monitors
Ops Overviewops-overview.jsonCross-environment health overview with composite status, error rates, latency, alert counts, and edge node counts

Import into Grafana:

  1. Click the "+" icon → "Import"
  2. Click "Upload JSON file" and select ops-overview.json
  3. Select your Prometheus data source when prompted
  4. Click "Import"

The dashboard includes:

  • Environment Health Table: Health status per namespace (Healthy/Degraded/Critical), error rates, p95 latency, alert counts, and edge node counts
  • Error Rate Trends: Error rates by environment over time
  • p95 Latency Trends: Latency percentiles by environment over time
  • Connected Edge Nodes: Edge node counts over time

Use the navigation links at the top of Ops Overview to drill down into specific orchestrator or edge dashboards.

Alternative: Import Pre-Built Orchestrator Dashboards

Skip manual dashboard creation by importing pre-built dashboards from the Expanso repository.

Download the dashboards:

The dashboards are in the docs/grafana/dashboards/orchestrator/ directory:

# Clone or navigate to the Expanso repository
git clone https://github.com/expanso-io/expanso.git
cd expanso/docs/grafana/dashboards/orchestrator/

Available dashboards:

DashboardFileWhat It Monitors
Golden Signalsgolden-signals.jsonThe four golden signals: latency, traffic, errors, and saturation
Process Metricsprocess-metrics.jsonCPU, memory, file descriptors, goroutines, GC, and database operations
Planner & Schedulerplanner-scheduler.jsonScheduler duration/rates, execution tracking, node selection, planner performance
Log Streaminglog-streaming-observability.jsonWebSocket connections, log stream consumers, state changes, cleanup operations
HTTP API Metricshttp-api-metrics.jsonRequest rates, latency percentiles, error rates, authentication failures, and validation errors

Import into Grafana:

  1. Click the "+" icon → "Import"
  2. Click "Upload JSON file" and select a dashboard JSON file
  3. Select your Prometheus data source when prompted
  4. Click "Import"

All dashboards use service_namespace for filtering. Configure it using the dropdown at the top of each dashboard.

Dashboard details:

Golden Signals — High-level service health based on Google's SRE framework:

  • Latency (request duration percentiles)
  • Traffic (request rate and active requests)
  • Errors (4xx/5xx error rates)
  • Saturation (CPU, memory, goroutines, file descriptors)

Process Metrics — Orchestrator health:

  • Process stats (CPU, memory, goroutines, file descriptors)
  • CPU/memory time series by instance
  • File descriptor tracking and instance summary
  • Database operations (duration, rates, throughput, store size)
  • Go runtime (heap, stack, GC metrics)

Planner & Scheduler — Job scheduling behavior:

  • Processing and execution rate stats
  • Scheduler duration percentiles (p50/p90/p99)
  • Processing rate by status and sub-operations
  • Execution creation vs loss tracking
  • Node selection (matched/rejected counts, rejection reasons)

Log Streaming — Log streaming infrastructure:

  • WebSocket proxy (connections, message rates, disconnections)
  • Log stream manager (consumer counts, state changes)
  • Cleanup operations
  • Per-job metrics for top consumers and connections

Alternative: Import Pre-Built Edge Node Dashboards

Pre-built dashboards for edge node monitoring are available in the Expanso repository.

Download the dashboards:

# Clone or navigate to the Expanso repository
git clone https://github.com/expanso-io/expanso.git
cd expanso/docs/grafana/dashboards/edge/

Available dashboards:

DashboardFileWhat It Monitors
Golden Signalsgolden-signals.jsonThe four golden signals: latency, traffic, errors, and saturation
Process Metricsprocess-metrics.jsonCPU, memory, goroutines, file descriptors, and Go runtime metrics
Pipeline Performancepipeline-performance.jsonPipeline throughput, latency, errors, and data volume
NATS Communicationnats-communication.jsonNATS messaging metrics, connection status, and message rates
Database Operationsdatabase-operations.jsonDatabase operation latency, counts, and data read/write metrics
HTTP API Metricshttp-api-metrics.jsonRequest rates, latency percentiles, error rates, authentication failures, and validation errors

All dashboards filter on service_namespace and service_instance_id, targeting service_name="expanso-edge" for edge-specific metrics.

Import into Grafana:

  1. Click the "+" icon → "Import"
  2. Click "Upload JSON file" and select a dashboard JSON file
  3. Select your Prometheus data source when prompted
  4. Click "Import"

Step 7: Monitor Job Execution Health

Let's create a dashboard focused on job execution state and throughput.

Create a new dashboard: Click "+" → "Create Dashboard"

Panel 1: Jobs by Execution State

This requires metrics from the orchestrator's job controller. While Expanso doesn't expose execution state as Prometheus metrics by default (it's accessible via the API), we can infer health from other signals.

For now, let's monitor job execution indirectly through resource usage patterns.

Panel 2: Edge Node Heartbeat Latency

Monitor the time between heartbeats to detect connectivity issues.

# Query: Time since last heartbeat (approximation)
time() - timestamp(expanso_process_cpu_utilization_ratio{service_name="expanso-edge"})
  • Panel type: Time series
  • Title: Heartbeat Freshness (seconds)
  • Legend: {{node_hostname}}
  • Unit: seconds
  • Alert threshold: > 90 seconds (Disconnected state)

Panel 3: Process Memory Growth Rate

Detect memory leaks by tracking growth rate.

# Query: Memory growth in MB per hour
rate(expanso_process_memory_usage_bytes[1h]) * 3600 / 1024 / 1024
  • Panel type: Time series
  • Title: Memory Growth Rate (MB/hour)
  • Legend: {{service_name}} - {{node_hostname}}
  • Unit: MB/h

Panel 4: GC Pause Times

Monitor garbage collection impact on performance.

# Query: Last GC pause in milliseconds
expanso_go_gc_pause_last_seconds * 1000
  • Panel type: Time series
  • Title: GC Pause Time (ms)
  • Legend: {{service_name}}
  • Unit: milliseconds
  • Alert threshold: > 100ms

Panel 5: CPU Utilization Heatmap

Visualize CPU patterns across all nodes.

# Query
expanso_process_cpu_utilization_ratio * 100
  • Panel type: Heatmap
  • Title: CPU Utilization Heatmap
  • Legend: {{node_hostname}}
  • Color scheme: Green-Yellow-Red

Save this dashboard as "Expanso Job & Node Health".

Step 8: Set Up Alerts for Critical Issues

Let's configure Prometheus alerting rules for common edge infrastructure problems.

Create the alerts directory:

mkdir -p ~/expanso-monitoring/prometheus/alerts

Create alert rules for node health:

Download the alert rules:

mkdir -p prometheus/alerts
curl -o prometheus/alerts/node-health.yml https://docs.expanso.io/examples/monitoring/node-health-alerts.yml

Or view the alert rules

This includes alerts for:

  • Edge node disconnections (>2 minutes of missed metrics)
  • Orchestrator downtime
  • High CPU usage (>85% for 5 minutes)
  • High memory usage (>1GB for 5 minutes)

Create alert rules for resource exhaustion:

Download the resource limit alerts:

curl -o prometheus/alerts/resource-limits.yml https://docs.expanso.io/examples/monitoring/resource-limits-alerts.yml

Or view the alert rules

This includes alerts for:

  • High file descriptor usage (>80%)
  • Memory leak detection (>10MB/s growth)
  • Goroutine leaks (>5000 goroutines)
  • Excessive GC activity (>2 cycles/second)

Reload Prometheus configuration:

# Reload config without restarting
curl -X POST http://localhost:9091/-/reload

Verify alerts are loaded:

Navigate to http://localhost:9091/alerts in Prometheus UI. You should see all configured alerts listed.

Check alert states:

The alerts will show as:

  • Green (Inactive): No issues detected
  • Yellow (Pending): Issue detected but waiting for for duration
  • Red (Firing): Alert is active

Alternative: Import Pre-Built Critical Alerts

If you're using Grafana's unified alerting system instead of Prometheus Alertmanager, pre-built alert rules are available in the Expanso repository:

# Clone or navigate to the Expanso repository
git clone https://github.com/expanso-io/expanso.git
cd expanso/docs/grafana/alerts/

The critical-alerts-prometheus.yaml file includes 12 production-ready alerts for:

  • High error rates (orchestrator and edge)
  • High latency
  • Memory and file descriptor exhaustion
  • Pipeline and scheduler failures
  • Database operation failures
  • Service availability (orchestrator metrics reporting, NATS connectivity)
  • WebSocket proxy errors
  • Sentry error spikes

Import via Alerting → Alert rules → Import in Grafana, select "Prometheus-compatible YAML file", then configure notification policies for severity=critical and severity=warning labels.

Incident Response Runbooks

When alerts fire, they include a direct link to the relevant runbook via the runbook_url annotation. You can also access runbooks directly from the table below.

Available runbooks:

AlertRunbook
High Error Rate (Orchestrator)high-error-rate-orchestrator.md
High Error Rate (Edge)high-error-rate-edge.md
High Latency (Orchestrator)high-latency-orchestrator.md
High Memory (Orchestrator)high-memory-orchestrator.md
High File Descriptor Usagehigh-file-descriptor-usage.md
Database Operation Failuresdatabase-operation-failures.md
Pipeline Execution Failurespipeline-execution-failures.md
Scheduler Processing Errorsscheduler-and-placement-issues.md
Job Placement Failuresscheduler-and-placement-issues.md
Orchestrator Not Reporting Metricsorchestrator-not-reporting-metrics.md
Orchestrator NATS Disconnectedorchestrator-nats-disconnected.md
WebSocket Proxy Error Ratewebsocket-proxy-error-rate.md
Sentry Error Spikesentry-error-spike.md

Each runbook includes:

  • Quick checklist for initial triage
  • Diagnostic steps with PromQL queries and CLI commands
  • Remediation procedures for common root causes
  • Escalation paths when issues persist
Custom Runbooks

Use RUNBOOK_TEMPLATE.md to create runbooks for your own custom alerts.

Step 9: Access and Correlate Logs

Metrics tell you when something's wrong, but logs tell you why. Let's integrate log access into your monitoring workflow.

Accessing orchestrator logs:

# View recent orchestrator logs
expanso orchestrator logs --tail=100

# Follow logs in real-time
expanso orchestrator logs --follow

# Filter for errors
expanso orchestrator logs --tail=500 | grep -i error

# Filter for specific node
expanso orchestrator logs --tail=500 | grep node-a1b2c3d4e5f6

Accessing edge node logs:

# On the edge node machine
sudo journalctl -u expanso-edge -n 100

# Follow in real-time
sudo journalctl -u expanso-edge -f

# Filter by time range
sudo journalctl -u expanso-edge --since "10 minutes ago"

# Export to file for analysis
sudo journalctl -u expanso-edge --since "1 hour ago" > edge-debug.log

Accessing job execution logs:

# List executions for a job
expanso job executions syslog-processor

# Get logs from a specific execution
expanso execution logs exec-xyz789 --tail=200

# Follow execution logs in real-time
expanso execution logs exec-xyz789 --follow

Correlating metrics with logs:

When you see an alert fire in Grafana, here's how to debug:

  1. Note the timestamp from the Grafana panel
  2. Identify the affected node from the metric labels
  3. Query logs around that time:
# If orchestrator CPU spiked at 15:30
expanso orchestrator logs --since "15:25" --until "15:35" | grep -i cpu

# If edge node disconnected
ssh user@edge-node
sudo journalctl -u expanso-edge --since "15:25" --until "15:35"

Common log patterns to look for:

# Node connection issues
grep "connection refused\|timeout\|network error" edge-debug.log

# Job execution failures
grep "execution failed\|pipeline error\|fatal" orchestrator.log

# Resource exhaustion
grep "out of memory\|too many open files\|disk full" *.log

# Authentication problems
grep "unauthorized\|authentication failed\|invalid token" *.log

Correlating Stream IDs with Loki Logs

When you stream logs using the CLI, the orchestrator assigns a unique stream ID to that session. You'll see this ID in the CLI output, and it's automatically added as a stream_id label to all logs sent to Loki—making it easy to correlate your log requests with the actual logs in your observability platform.

Start streaming and you'll see the stream ID in the output:

$ expanso-cli job logs my-pipeline

Streaming logs for job `my-pipeline` on node `edge-node-abc123` (stream: stream-xyz789)...

That stream ID in parentheses (stream-xyz789) is now a label on every log from this stream.

Querying Logs by Stream ID:

Use the stream_id label to filter logs in Loki and see only the logs from a specific streaming session:

{stream_id="stream-xyz789"}

This is useful when you need to:

  • Filter out noise and see only the logs you requested from a specific session
  • Match the stream ID from an API response to logs in Loki
  • Verify that logs from a specific stream are reaching Loki
  • Track which logs were streamed to which users or systems

Example Workflow:

Start streaming logs and note the stream ID:

$ expanso-cli job logs my-pipeline
Streaming logs for job `my-pipeline` on node `edge-node-abc123` (stream: stream-xyz789)...

Query Loki for logs from that specific stream:

{stream_id="stream-xyz789"}

Combine with other labels for more precise filtering:

{stream_id="stream-xyz789", level="error"}

The stream_id label ensures that logs in Loki can always be traced back to the specific log streaming request that generated them.

Historical Log Streaming

Live log tailing is great for monitoring what's happening right now, but what about investigating issues that happened earlier? Maybe a job failed at 3am and you need to see what went wrong. Or you're doing compliance auditing and need to extract a specific time window of logs. That's where historical log queries come in.

Expanso lets you query logs from rotated log files, giving you access to past events for debugging, auditing, or analysis.

Three Streaming Modes

The log streaming API gives you flexibility in how you access logs. Depending on which start and end timestamp parameters you provide, you get different behavior:

StartEndModeDescription
Not setNot setLive tailing (default)Stream logs from current time forward
SetNot setHistorical → liveReplay logs from past timestamp, then continue live
SetSetBounded historicalQuery logs within a specific time window

Use Cases

Live monitoring (default mode):

  • Watch logs as they happen in real-time
  • Monitor active jobs for errors or performance issues
  • Debug problems that are happening right now

Historical queries (bounded mode):

  • Investigate past incidents: "What happened at 3am when the job failed?"
  • Compliance auditing: Extract logs for a specific time period
  • Log analysis: Fetch historical data for trend analysis or troubleshooting

Historical replay → live (hybrid mode):

  • Catch up on missed logs after a network outage
  • Review recent history before continuing to monitor live

API Examples

Live tailing (default)

This is the simplest mode—just watch logs as they're written. Perfect for monitoring an active job:

curl -X POST https://api.expanso.io/api/v1/jobs/my-job/logs \
-H "Authorization: Bearer $TOKEN" \
-d '{
"node_id": "node-123"
}'

Query last hour of logs

When you need to look at a specific time window, provide both start and end timestamps. This example pulls logs from exactly 1pm to 2pm:

curl -X POST https://api.expanso.io/api/v1/jobs/my-job/logs \
-H "Authorization: Bearer $TOKEN" \
-d '{
"node_id": "node-123",
"start": "2025-01-09T13:00:00Z",
"end": "2025-01-09T14:00:00Z"
}'

Replay from 1 hour ago, then live tail

This hybrid mode is useful when you want to catch up on what you missed and then keep watching. It starts from a past timestamp and then transitions to live tailing:

curl -X POST https://api.expanso.io/api/v1/jobs/my-job/logs \
-H "Authorization: Bearer $TOKEN" \
-d '{
"node_id": "node-123",
"start": "2025-01-09T13:00:00Z"
}'

Filter by log level

You can combine time ranges with other filters like log level. This example pulls only error logs from a specific hour:

curl -X POST https://api.expanso.io/api/v1/jobs/my-job/logs \
-H "Authorization: Bearer $TOKEN" \
-d '{
"node_id": "node-123",
"start": "2025-01-09T13:00:00Z",
"end": "2025-01-09T14:00:00Z",
"log_level": "error"
}'

Time Range Validation

When you're querying historical logs, the API validates your time range to make sure it makes sense. If you provide both start and end, the end must be after start—otherwise you'll get a 400 Bad Request error.

Zero timestamps (when you omit start or end) are perfectly valid and tell the system to use default behavior. You can also span from the past into the future, or even provide just an end cutoff (though that's unusual).

Invalid Time Range

If you accidentally swap your start and end times, you'll get an error:

{
"node_id": "node-123",
"start": "2025-01-09T14:00:00Z",
"end": "2025-01-09T13:00:00Z"
}

Returns 400 Bad Request: "end time must be after start time"

How It Works

Behind the scenes, Expanso uses different mechanisms depending on which mode you're using:

Live mode tails the current log file from the end, streaming new logs as they're written. It's like running tail -f on the log file.

Historical mode reads from rotated log files (like job.log.1, job.log.2) that match your time range. The system searches these files by timestamp and streams logs in chronological order.

Historical→live mode starts by reading historical logs from rotated files, then seamlessly transitions to live tailing of the current file once it catches up.

Step 10: Debug Common Issues Using Metrics

Let's walk through real debugging scenarios using the monitoring stack you've built.

Scenario 1: Edge Node Shows Intermittent Disconnections

Symptom: Grafana shows gaps in metrics for a specific edge node.

Investigation:

  1. Check the heartbeat freshness panel—you see periodic spikes to 120+ seconds
  2. Query Prometheus for network quality:
# Gaps in metrics collection
resets(expanso_process_cpu_utilization_ratio{node_id="node-abc123"}[1h])
  1. SSH to the edge node and check network:
# Ping orchestrator
ping -c 10 orchestrator.example.com

# Check packet loss
mtr orchestrator.example.com

# Review network errors
sudo journalctl -u expanso-edge -n 200 | grep -i "connection\|network"

Common causes:

  • Unstable WiFi/cellular connection (increase heartbeat interval)
  • Firewall intermittently blocking port 9090
  • Network congestion during peak hours

Solution: Increase heartbeat tolerance or improve network infrastructure.

Scenario 2: Orchestrator Memory Usage Growing Continuously

Symptom: Memory growth rate panel shows consistent positive trend.

Investigation:

  1. Check Go heap metrics in Grafana:
# Heap allocation over time
expanso_go_memory_heap_alloc_bytes{service_name="expanso-orchestrator"}

# Memory growth rate
rate(expanso_go_memory_heap_alloc_bytes[1h])
  1. Compare heap vs. system memory:
# If heap is stable but RSS grows, it's a different issue
expanso_process_memory_usage_bytes{service_name="expanso-orchestrator"} -
expanso_go_memory_heap_alloc_bytes{service_name="expanso-orchestrator"}
  1. Check goroutine count for leaks:
expanso_go_goroutines_count{service_name="expanso-orchestrator"}
  1. If goroutines are growing, get a goroutine dump:
# Enable pprof endpoint (if configured)
curl http://localhost:6060/debug/pprof/goroutine?debug=1 > goroutines.txt

# Analyze the dump for stuck goroutines
grep -A 5 "created by" goroutines.txt | sort | uniq -c | sort -rn

Common causes:

  • Goroutine leaks from unclosed connections
  • Cached data not being evicted
  • Large number of historical records in state store

Solution: Review the goroutine dump, fix leaks, or implement cache size limits.

Scenario 3: Job Execution Not Processing Data

Symptom: Job is marked as "running" but no output is produced.

Investigation:

  1. Check if the edge node is connected:
expanso node list
  1. Get execution details:
expanso execution describe exec-xyz789
  1. Check execution logs for errors:
expanso execution logs exec-xyz789 --tail=100
  1. Look for resource constraints on the edge node:
# CPU at limits?
expanso_process_cpu_utilization_ratio{node_id="node-abc123"} > 0.95

# Memory at limits?
expanso_process_memory_usage_bytes{node_id="node-abc123"}
  1. SSH to edge node and check local pipeline status:
# Check if pipeline process is running
expanso-edge status

# Check local buffer for backed-up data
ls -lh /var/lib/expanso/buffer/

Common causes:

  • Input source (file, network) is not producing data
  • Pipeline filter is dropping all messages
  • Output destination is unreachable (network partition)
  • Resource limits preventing processing

Solution: Review pipeline configuration, check input/output connectivity, verify resource allocation.

Step 11: Track Deployment Progress

When you roll out job updates, monitoring deployment progress prevents issues from affecting your entire fleet.

Monitor rolling deployment progress:

# Start a rolling deployment
expanso job deploy updated-job.yaml

# Watch deployment status
watch -n 2 'expanso deployment list'

In Grafana, create a deployment tracking panel:

# Count nodes by execution version
count by (version) (
# This requires custom metrics from the orchestrator
# For now, monitor via API or CLI
)

Best practices for deployment monitoring:

  1. Watch for failures during rollout:
# Monitor execution health during deployment
watch -n 5 'expanso job executions my-job | grep -E "failed|error"'
  1. Verify new version before proceeding:
# Check health of new version executions
expanso job executions my-job --version=2 | grep healthy
  1. Roll back if issues detected:
# If new version shows problems, roll back to previous version
expanso-cli job rollback my-job

Create a deployment dashboard panel:

Add to your Grafana dashboard to show deployment activity:

# Recent configuration changes (approximation via process restarts)
changes(expanso_process_cpu_utilization_ratio[5m])

Step 12: Create Custom Alerts for Your Workload

Generic alerts are a start, but you'll want workload-specific alerts based on your edge computing use case.

Example: Log Processing Pipeline Alerts

If you're processing syslog data:

prometheus/alerts/syslog-pipeline.yml
groups:
- name: syslog_pipeline
interval: 30s
rules:
# Alert if log processing throughput drops
- alert: LowLogThroughput
expr: |
# This requires custom metrics from your pipeline
# Placeholder: monitor CPU as proxy for activity
avg(rate(expanso_process_cpu_utilization_ratio{
service_name="expanso-edge"
}[5m])) < 0.05
for: 10m
labels:
severity: warning
workload: syslog
annotations:
summary: "Log processing throughput is low"
description: "Average edge node CPU usage is below 5%, indicating low log volume or processing issues"

Example: IoT Data Collection Alerts

If you're collecting sensor data:

prometheus/alerts/iot-collection.yml
groups:
- name: iot_data_collection
interval: 30s
rules:
# Alert if edge nodes stop collecting data
- alert: DataCollectionStalled
expr: |
# Monitor memory growth as proxy for data buffering
rate(expanso_process_memory_usage_bytes{
service_name="expanso-edge"
}[10m]) == 0
for: 15m
labels:
severity: warning
workload: iot
annotations:
summary: "Data collection may have stalled on {{ $labels.node_hostname }}"
description: "Edge node memory is not growing, which may indicate no data is being buffered"

Reload Prometheus to apply custom alerts:

curl -X POST http://localhost:9091/-/reload

Verification Checklist

Let's verify your complete monitoring setup:

  • ✅ Orchestrator configured to export metrics via OTLP
  • ✅ Edge nodes configured to export metrics via OTLP
  • ✅ OpenTelemetry Collector receiving metrics from all components
  • ✅ Prometheus scraping metrics from OTLP Collector
  • ✅ Grafana connected to Prometheus data source
  • ✅ Fleet Overview dashboard created with key health metrics
  • ✅ Job & Node Health dashboard created
  • ✅ Alert rules configured for node disconnections
  • ✅ Alert rules configured for resource exhaustion
  • ✅ Alerts visible in Prometheus UI
  • ✅ Log access configured for orchestrator and edge nodes
  • ✅ Able to correlate metrics with logs for debugging
  • ✅ Tested at least one debugging scenario
  • ✅ Custom workload-specific alerts configured

If all items are checked, congratulations! You have production-grade monitoring for your edge fleet.

What You Learned

You've built a comprehensive monitoring stack for Expanso:

  • ✅ Configured OpenTelemetry metrics export from orchestrator and edge nodes
  • ✅ Deployed and configured Prometheus for metrics collection
  • ✅ Created Grafana dashboards for fleet health visualization
  • ✅ Set up process metrics (CPU, memory, file descriptors)
  • ✅ Enabled Go runtime metrics (GC, goroutines, heap)
  • ✅ Configured alerts for critical issues (node disconnections, resource exhaustion)
  • ✅ Integrated log access for debugging
  • ✅ Practiced debugging common edge infrastructure problems
  • ✅ Monitored deployment progress and health
  • ✅ Created custom workload-specific alerts

Key Concepts

OpenTelemetry (OTLP): Industry-standard protocol for exporting telemetry data (metrics, traces, logs). Expanso uses OTLP to send metrics to collectors, making it compatible with any OTLP-compatible backend.

Process Metrics: OS-level metrics about running processes including CPU utilization, memory usage (RSS and virtual), and file descriptor counts. Always enabled in Expanso.

Go Runtime Metrics: Go-specific metrics about garbage collection, goroutine counts, heap allocation, and memory management. Optional in Expanso, enabled via include_go_metrics: true.

Heartbeats: Periodic health reports from edge nodes to the orchestrator (every 30 seconds by default). Missing heartbeats indicate connectivity issues or node failures.

Metric Cardinality: The number of unique time series in Prometheus. High cardinality (many unique label combinations) increases storage and query costs. Monitor cardinality in production deployments.

PromQL (Prometheus Query Language): Query language for retrieving and aggregating metrics from Prometheus. Supports functions like rate(), increase(), histogram_quantile() for analysis.

Alert Rules: Prometheus expressions that evaluate to true/false. When an alert rule evaluates to true for the specified for duration, the alert fires.

Observability Pillars: Metrics (what's happening), logs (why it's happening), traces (how it's happening). This tutorial covered metrics and logs; tracing is covered in advanced guides.

Next Steps

Now that you have comprehensive monitoring, here's what to explore next:

Advanced Monitoring:

Alerting and Incident Response:

Performance Optimization:

Production Operations:

Architecture Understanding:

Node Disconnection and Recovery

Edge nodes can lose connectivity due to network issues, maintenance, or resource problems. Expanso handles this automatically.

Connection States

Expanso tracks four connection states:

  • Connecting: Node completed handshake and is establishing a stable connection
  • Connected: Node is sending heartbeats and can receive new work
  • Disconnected: Node missed heartbeats for 90 seconds but may recover
  • Lost: Node has been offline for 1 hour and is considered gone

New work only goes to Connected nodes. Nodes in the Connecting state will automatically transition to Connected once they've established a stable connection.

What Happens to Jobs

Daemon jobs (long-running pipelines):

  • Running executions stay as-is and may recover when the node reconnects
  • Pending executions are marked Failed immediately
  • After 1 hour offline, all executions are marked Failed and rescheduled to healthy nodes

Ops jobs (one-shot tasks):

  • Executions are marked Failed after 90 seconds
  • New executions are scheduled automatically when the node reconnects

Automatic Recovery

When a disconnected node reconnects:

  1. The node resumes sending heartbeats
  2. It transitions back to Connected state
  3. For daemon jobs, running executions continue processing
  4. For ops jobs, new executions are scheduled automatically

You don't need to redeploy jobs—the system handles recovery automatically.

Verify Node Reconnection

After a node comes back online, verify it's receiving work:

# Check node connection state
expanso-cli node list
expanso-cli node describe <node-id>

Look for ConnectionState: Connected in the output.

Check that executions are running:

# List executions on the node
expanso-cli node executions <node-id>

You should see executions progressing from PendingStartingRunning.

Stuck Pending Executions

Rarely, executions get stuck in Pending state if a node disconnects at the wrong moment during message delivery.

Signs of stuck executions:

  • Executions show Pending state for more than 2 minutes
  • Jobs show "deploying" state for extended periods

Check for stuck executions:

expanso-cli job executions <job-name>
expanso-cli execution describe <execution-id>

Workaround:

Redeploy the job:

expanso-cli job stop <job-name>
expanso-cli job delete <job-name>
# Wait 30 seconds
expanso-cli job deploy my-pipeline.yaml

Troubleshoot Reconnection Issues

If a node reconnects but pipelines aren't running:

Verify the node is connected:

expanso-cli node describe <node-id>
# Look for ConnectionState: Connected and recent LastHeartbeat

If disconnected, check network connectivity and that the edge agent is running:

systemctl status expanso-edge

Check for executions:

expanso-cli job executions <job-name>
expanso-cli node executions <node-id>

Common issues:

  • Resource constraints: Node doesn't have enough CPU/memory
  • Bootstrap token expired: Re-bootstrap the node
  • Label mismatch: Node labels don't match job selector

Troubleshooting

Metrics Not Appearing in Prometheus

Symptom: Prometheus UI shows no metrics with expanso_ prefix.

Diagnosis:

# Check OTLP collector is receiving data
docker compose logs otel-collector | grep -i "received"

# Check Prometheus is scraping successfully
curl http://localhost:9091/api/v1/targets | jq

Common causes:

  1. OTLP collector not reachable: Verify network connectivity from orchestrator/edge nodes
  2. Telemetry disabled: Check do_not_track: false in configs
  3. Wrong endpoint: Verify endpoint matches OTLP collector address
  4. Firewall blocking: Ensure port 4317 (gRPC) or 4318 (HTTP) is open

Solution:

# Test connectivity from orchestrator
nc -zv localhost 4317

# Check telemetry logs
sudo journalctl -u expanso-orchestrator | grep -i telemetry

# Restart with debug logging
# In config: logging.level = "debug"
sudo systemctl restart expanso-orchestrator

Alerts Not Firing When Expected

Symptom: Alert conditions are met but alerts don't show as firing.

Diagnosis:

# Check alert rule syntax
curl http://localhost:9091/api/v1/rules | jq

# Manually evaluate alert expression
# In Prometheus UI, run the alert's expr query

Common causes:

  1. for duration not elapsed: Alert is pending, wait for full duration
  2. Syntax error in rule: Check YAML indentation and PromQL syntax
  3. Label mismatch: Alert expr doesn't match any actual metrics
  4. Prometheus not reloaded: Alert rules weren't reloaded after changes

Solution:

# Validate alert rule syntax
promtool check rules prometheus/alerts/*.yml

# Reload Prometheus config
curl -X POST http://localhost:9091/-/reload

# Check Prometheus logs for errors
docker compose logs prometheus | grep -i error

Grafana Shows "No Data"

Symptom: Grafana panels show "No data" despite Prometheus having metrics.

Diagnosis:

  1. Test the query in Prometheus UI first
  2. Check the time range in Grafana (top-right)
  3. Verify data source is correctly configured

Common causes:

  1. Query syntax error: PromQL syntax differs between Prometheus and Grafana
  2. Time range mismatch: Data exists but not in selected time range
  3. Data source issue: Grafana can't reach Prometheus
  4. Legend template error: Invalid label syntax in legend

Solution:

# Test Prometheus data source in Grafana
# Settings → Data Sources → Expanso Prometheus → Save & Test

# Check Grafana logs
docker compose logs grafana | grep -i error

# Verify Prometheus is reachable from Grafana container
docker compose exec grafana curl http://prometheus:9090/api/v1/query?query=up

High Metric Cardinality Warning

Symptom: Prometheus logs warn about high cardinality or memory usage.

Diagnosis:

# Count unique time series
count({__name__=~"expanso_.*"})

# Find metrics with most series
topk(10, count by (__name__) ({__name__=~"expanso_.*"}))

# Identify high-cardinality labels
count by (node_id, service_name) ({__name__=~"expanso_.*"})

Common causes:

  1. Many unique node IDs: Normal for large fleets, consider sampling
  2. High-cardinality labels: Labels with many unique values (timestamps, IDs)
  3. Metrics explosion: New metrics added without considering cardinality impact

Solution:

# In Prometheus config, add metric relabeling to drop high-cardinality labels
scrape_configs:
- job_name: 'expanso-metrics'
metric_relabel_configs:
# Drop labels with high cardinality if not needed
- source_labels: [execution_id]
action: labeldrop

Edge Node Metrics Delayed or Inconsistent

Symptom: Some edge nodes show stale metrics or gaps in data.

Diagnosis:

# Check edge node telemetry logs
ssh user@edge-node
sudo journalctl -u expanso-edge | grep -i "telemetry\|export"

# Check network latency to collector
ping -c 10 orchestrator.example.com

Common causes:

  1. Network latency: High latency to OTLP collector
  2. Export interval too short: Increase export_interval for edge nodes
  3. Collector overload: OTLP collector can't handle export rate
  4. Batch size too small: Increase batch size in OTLP collector config

Solution:

# In edge node config, increase export interval for unstable networks
telemetry:
export_interval: "60s" # Increase from 15s to 60s

# In OTLP collector, increase batch size
processors:
batch:
timeout: 30s
send_batch_size: 2048 # Increase from 1024

Need More Help?

If you're still experiencing issues:

  1. Enable debug logging in all components:

    logging:
    level: debug
  2. Collect diagnostic information:

    # Orchestrator diagnostics
    expanso orchestrator logs --tail=500 > orch-debug.log

    # Prometheus diagnostics
    curl http://localhost:9091/api/v1/status/tsdb > prom-status.json

    # OTLP collector diagnostics
    docker compose logs otel-collector > otel-debug.log
  3. Check the documentation:

  4. Community support: