OpenTelemetry Metrics Export
Monitor your Expanso Edge nodes by exporting metrics to your observability platform using OpenTelemetry (OTLP). Track pipeline performance, resource usage, and system health across your entire edge fleet.
Architecture
Expanso Edge uses OpenTelemetry Protocol (OTLP) to push metrics to a collector, which then exports to your monitoring backend:
┌─────────────────┐ OTLP ┌──────────────────┐
│ Expanso Edge │ ──────────────> │ OpenTelemetry │
│ Node │ (push, gRPC) │ Collector │
└─────────────────┘ └───────── ─────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌──────────┐
│Prometheus│ │ Grafana │ │ Datadog │
│ │ │ Cloud │ │ │
└──────────┘ └─────────┘ └──────────┘
Why OTLP instead of Prometheus scraping?
- ✅ Works through firewalls/NAT (push, not pull)
- ✅ Single metrics pipeline for multiple backends
- ✅ No inbound ports required on edge nodes
- ✅ Centralized collector for filtering/routing
Quick Start
1. Configure Edge Node
Enable telemetry export in your edge configuration:
name: edge-node-1
data_dir: /var/lib/expanso-edge
# Enable telemetry export
telemetry:
# OpenTelemetry Collector endpoint
endpoint: "otel-collector.example.com:4317"
protocol: grpc
export_interval: 30s
# Include Go runtime metrics
include_go_metrics: true
process_metrics_interval: 15s
# Tag all metrics with these attributes
resource_attributes:
service.name: "expanso-edge"
environment: "production"
region: "us-west-2"
2. Deploy OpenTelemetry Collector
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
exporters:
# Export to Prometheus via remote write
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]
3. Query Metrics
Once in Prometheus, query edge node metrics:
# Pipeline readiness time
histogram_quantile(0.95, rate(pipeline_readiness_duration_bucket[5m]))
# Memory usage in MB
process_memory_usage / 1024 / 1024
# Pipeline errors
rate(pipeline_orchestration_errors_total[5m])
Available Metrics
Pipeline Metrics
Metrics from data pipeline execution:
| Metric | Type | Description |
|---|---|---|
pipeline.readiness.duration | Histogram | Time for pipeline to become ready during startup |
pipeline.orchestration.errors | Counter | Number of pipeline orchestration errors |
Process Metrics
System resource metrics (collected automatically):
| Metric | Type | Description |
|---|---|---|
process.cpu.time | Counter | CPU time consumed by the process |
process.memory.usage | Gauge | Resident memory size (RSS) |
process.memory.virtual | Gauge | Virtual memory size |
process.open_file_descriptors | Gauge | Number of open file descriptors (Unix/Linux) |
Platform support: Linux, macOS, Windows
Go Runtime Metrics
When include_go_metrics: true:
| Metric | Type | Description |
|---|---|---|
runtime.go.goroutines | Gauge | Number of active goroutines |
runtime.go.gc.count | Counter | Garbage collection count |
runtime.go.gc.pause_ns | Histogram | GC pause duration |
runtime.go.mem.heap_alloc | Gauge | Bytes allocated on heap |
runtime.go.mem.heap_sys | Gauge | Heap system memory |
runtime.go.mem.heap_idle | Gauge | Idle heap memory |
runtime.go.mem.heap_inuse | Gauge | In-use heap memory |
runtime.go.mem.heap_released | Gauge | Released heap memory |
Configuration Reference
Complete Telemetry Config
telemetry:
# Disable all telemetry (default: false)
do_not_track: false
# Collector endpoint (required)
endpoint: "collector.example.com:4317"
# Optional path under endpoint (e.g., "/v1/metrics")
endpoint_path: ""
# Protocol: "grpc" (recommended, port 4317) or "http" (port 4318)
protocol: grpc
# Skip TLS verification (NOT recommended for production)
insecure: false
# How often to export metrics (default: 30s)
export_interval: 30s
# Custom headers for authentication
headers:
Authorization: "Bearer your-api-token"
X-Custom-Header: "value"
# Resource attributes (tags/labels applied to all metrics)
resource_attributes:
service.name: "expanso-edge"
service.version: "1.0.0"
deployment.environment: "production"
cloud.region: "us-west-2"
cloud.availability_zone: "us-west-2a"
host.name: "${HOSTNAME}"
# Include Go runtime metrics (default: false)
include_go_metrics: true
# Process metrics collection interval (default: 15s)
process_metrics_interval: 15s
# Alternative authentication config
authentication:
type: "Bearer" # or "Basic"
token: "your-bearer-token"
namespace: "production"
tenant: "acme-corp"
Authentication
Method 1: Headers (recommended)
telemetry:
endpoint: "collector.example.com:4317"
protocol: grpc
headers:
Authorization: "Bearer ${OTEL_TOKEN}"
Method 2: Authentication Config
telemetry:
endpoint: "collector.example.com:4317"
protocol: grpc
authentication:
type: "Bearer"
token: "${OTEL_TOKEN}"
namespace: "production"
Monitoring Backend Setup
Prometheus + Grafana
OpenTelemetry Collector Config:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
exporters:
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
external_labels:
cluster: "edge-fleet-1"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]
Prometheus Config:
global:
scrape_interval: 15s
# Enable remote write receiver
# Start with: --web.enable-remote-write-receiver
Grafana Dashboard Queries:
# Pipeline readiness (p95)
histogram_quantile(0.95,
rate(pipeline_readiness_duration_bucket{service_name="expanso-edge"}[5m])
)
# Memory usage per node
process_memory_usage{service_name="expanso-edge"} / 1024 / 1024
# Pipeline error rate
rate(pipeline_orchestration_errors_total[5m])
# CPU usage percentage
rate(process_cpu_time{service_name="expanso-edge"}[5m]) * 100
# Goroutines (if Go metrics enabled)
runtime_go_goroutines{service_name="expanso-edge"}
Grafana Cloud
exporters:
prometheusremotewrite:
endpoint: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
headers:
Authorization: "Bearer ${GRAFANA_CLOUD_API_KEY}"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]
Datadog
exporters:
datadog:
api:
key: "${DD_API_KEY}"
site: datadoghq.com
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [datadog]
Elastic (ELK Stack)
exporters:
otlp/elastic:
endpoint: "https://elastic-apm-server:8200"
headers:
Authorization: "Bearer ${ELASTIC_APM_TOKEN}"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/elastic]
New Relic
exporters:
otlp/newrelic:
endpoint: "https://otlp.nr-data.net:4317"
headers:
api-key: "${NEW_RELIC_LICENSE_KEY}"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/newrelic]
Honeycomb
exporters:
otlp/honeycomb:
endpoint: "api.honeycomb.io:443"
headers:
x-honeycomb-team: "${HONEYCOMB_API_KEY}"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/honeycomb]
Docker Compose Example
Complete monitoring stack with edge node, collector, Prometheus, and Grafana:
version: '3.8'
services:
expanso-edge:
image: ghcr.io/expanso-io/expanso-edge:latest
environment:
- EXPANSO_EDGE_NAME=edge-docker-1
- HOSTNAME=edge-docker-1
volumes:
- ./edge-config.yaml:/etc/expanso/config.yaml
- edge-data:/var/lib/expanso-edge
depends_on:
- otel-collector
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
prometheus:
image: prom/prometheus:latest
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-remote-write-receiver'
ports:
- "9090:9090"
volumes:
- prometheus-data:/prometheus
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
edge-data:
prometheus-data:
grafana-data:
Troubleshooting
No Metrics Appearing
- Enable debug logging on edge node:
log:
level: debug
format: json
Check logs for telemetry export attempts:
journalctl -u expanso-edge | grep -i telemetry
# or
docker logs expanso-edge | grep -i telemetry
- Verify collector is receiving:
docker logs otel-collector
# Look for: "OTLP receiver started"
- Test connectivity:
# From edge node, verify collector is reachable
telnet collector.example.com 4317
Authentication Errors
Look for authentication failures in logs:
"error": "failed to export metrics: rpc error: code = Unauthenticated"
Verify:
- Token/API key is correct
- Headers are properly formatted
- Authentication type matches collector config
Metrics Have Wrong Names
OTLP may transform metric names. Check your monitoring backend's OTLP documentation for name transformations.
High Memory Usage
If Go metrics are enabled and memory usage is high:
telemetry:
include_go_metrics: false # Disable Go runtime metrics
process_metrics_interval: 30s # Reduce collection frequency
Security Best Practices
-
Always use TLS in production:
telemetry:
endpoint: "collector.example.com:4317"
insecure: false # Verify TLS certificates -
Use authentication:
telemetry:
headers:
Authorization: "Bearer ${OTEL_TOKEN}" -
Network isolation:
- Keep collector on private network
- Use firewall rules to restrict access to port 4317/4318
-
Rotate credentials regularly:
- Use environment variables for tokens
- Implement token rotation policy
-
Limit resource attributes:
- Don't include sensitive data in attributes
- Keep cardinality reasonable
Performance Considerations
Export Interval
telemetry:
# Lower = more frequent updates, higher overhead
# Higher = less overhead, delayed metrics
export_interval: 30s # Good default
# High-frequency (more overhead):
# export_interval: 10s
# Low-frequency (less overhead):
# export_interval: 60s
Process Metrics Interval
telemetry:
# How often to collect process metrics
process_metrics_interval: 15s # Default
# Reduce overhead with less frequent collection:
# process_metrics_interval: 30s
Go Metrics
telemetry:
# Disable if not needed (reduces overhead):
include_go_metrics: false
# Enable for debugging memory/GC issues:
include_go_metrics: true