Skip to main content

OpenTelemetry Metrics Export

Monitor your Expanso Edge nodes by exporting metrics to your observability platform using OpenTelemetry (OTLP). Track pipeline performance, resource usage, and system health across your entire edge fleet.

Edge Node Monitoring vs Fleet Monitoring

Architecture

Expanso Edge uses OpenTelemetry Protocol (OTLP) to push metrics to a collector, which then exports to your monitoring backend:

┌─────────────────┐      OTLP        ┌──────────────────┐
│ Expanso Edge │ ──────────────> │ OpenTelemetry │
│ Node │ (push, gRPC) │ Collector │
└─────────────────┘ └──────────────────┘

┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌──────────┐
│Prometheus│ │ Grafana │ │ Datadog │
│ │ │ Cloud │ │ │
└──────────┘ └─────────┘ └──────────┘

Why OTLP instead of Prometheus scraping?

  • ✅ Works through firewalls/NAT (push, not pull)
  • ✅ Single metrics pipeline for multiple backends
  • ✅ No inbound ports required on edge nodes
  • ✅ Centralized collector for filtering/routing

Quick Start

1. Configure Edge Node

Enable telemetry export in your edge configuration:

edge-config.yaml
name: edge-node-1
data_dir: /var/lib/expanso-edge

# Enable telemetry export
telemetry:
# OpenTelemetry Collector endpoint
endpoint: "otel-collector.example.com:4317"
protocol: grpc
export_interval: 30s

# Include Go runtime metrics
include_go_metrics: true
process_metrics_interval: 15s

# Tag all metrics with these attributes
resource_attributes:
service.name: "expanso-edge"
environment: "production"
region: "us-west-2"

2. Deploy OpenTelemetry Collector

otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"

exporters:
# Export to Prometheus via remote write
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]

3. Query Metrics

Once in Prometheus, query edge node metrics:

# Pipeline readiness time
histogram_quantile(0.95, rate(pipeline_readiness_duration_bucket[5m]))

# Memory usage in MB
process_memory_usage / 1024 / 1024

# Pipeline errors
rate(pipeline_orchestration_errors_total[5m])

Available Metrics

Pipeline Metrics

Metrics from data pipeline execution:

MetricTypeDescription
pipeline.readiness.durationHistogramTime for pipeline to become ready during startup
pipeline.orchestration.errorsCounterNumber of pipeline orchestration errors

Process Metrics

System resource metrics (collected automatically):

MetricTypeDescription
process.cpu.timeCounterCPU time consumed by the process
process.memory.usageGaugeResident memory size (RSS)
process.memory.virtualGaugeVirtual memory size
process.open_file_descriptorsGaugeNumber of open file descriptors (Unix/Linux)

Platform support: Linux, macOS, Windows

Go Runtime Metrics

When include_go_metrics: true:

MetricTypeDescription
runtime.go.goroutinesGaugeNumber of active goroutines
runtime.go.gc.countCounterGarbage collection count
runtime.go.gc.pause_nsHistogramGC pause duration
runtime.go.mem.heap_allocGaugeBytes allocated on heap
runtime.go.mem.heap_sysGaugeHeap system memory
runtime.go.mem.heap_idleGaugeIdle heap memory
runtime.go.mem.heap_inuseGaugeIn-use heap memory
runtime.go.mem.heap_releasedGaugeReleased heap memory

Configuration Reference

Complete Telemetry Config

edge-config.yaml
telemetry:
# Disable all telemetry (default: false)
do_not_track: false

# Collector endpoint (required)
endpoint: "collector.example.com:4317"

# Optional path under endpoint (e.g., "/v1/metrics")
endpoint_path: ""

# Protocol: "grpc" (recommended, port 4317) or "http" (port 4318)
protocol: grpc

# Skip TLS verification (NOT recommended for production)
insecure: false

# How often to export metrics (default: 30s)
export_interval: 30s

# Custom headers for authentication
headers:
Authorization: "Bearer your-api-token"
X-Custom-Header: "value"

# Resource attributes (tags/labels applied to all metrics)
resource_attributes:
service.name: "expanso-edge"
service.version: "1.0.0"
deployment.environment: "production"
cloud.region: "us-west-2"
cloud.availability_zone: "us-west-2a"
host.name: "${HOSTNAME}"

# Include Go runtime metrics (default: false)
include_go_metrics: true

# Process metrics collection interval (default: 15s)
process_metrics_interval: 15s

# Alternative authentication config
authentication:
type: "Bearer" # or "Basic"
token: "your-bearer-token"
namespace: "production"
tenant: "acme-corp"

Authentication

Method 1: Headers (recommended)

telemetry:
endpoint: "collector.example.com:4317"
protocol: grpc
headers:
Authorization: "Bearer ${OTEL_TOKEN}"

Method 2: Authentication Config

telemetry:
endpoint: "collector.example.com:4317"
protocol: grpc
authentication:
type: "Bearer"
token: "${OTEL_TOKEN}"
namespace: "production"

Monitoring Backend Setup

Prometheus + Grafana

OpenTelemetry Collector Config:

otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"

exporters:
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
external_labels:
cluster: "edge-fleet-1"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]

Prometheus Config:

prometheus.yml
global:
scrape_interval: 15s

# Enable remote write receiver
# Start with: --web.enable-remote-write-receiver

Grafana Dashboard Queries:

# Pipeline readiness (p95)
histogram_quantile(0.95,
rate(pipeline_readiness_duration_bucket{service_name="expanso-edge"}[5m])
)

# Memory usage per node
process_memory_usage{service_name="expanso-edge"} / 1024 / 1024

# Pipeline error rate
rate(pipeline_orchestration_errors_total[5m])

# CPU usage percentage
rate(process_cpu_time{service_name="expanso-edge"}[5m]) * 100

# Goroutines (if Go metrics enabled)
runtime_go_goroutines{service_name="expanso-edge"}

Grafana Cloud

otel-collector-config.yaml
exporters:
prometheusremotewrite:
endpoint: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
headers:
Authorization: "Bearer ${GRAFANA_CLOUD_API_KEY}"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]

Datadog

otel-collector-config.yaml
exporters:
datadog:
api:
key: "${DD_API_KEY}"
site: datadoghq.com

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [datadog]

Elastic (ELK Stack)

otel-collector-config.yaml
exporters:
otlp/elastic:
endpoint: "https://elastic-apm-server:8200"
headers:
Authorization: "Bearer ${ELASTIC_APM_TOKEN}"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/elastic]

New Relic

otel-collector-config.yaml
exporters:
otlp/newrelic:
endpoint: "https://otlp.nr-data.net:4317"
headers:
api-key: "${NEW_RELIC_LICENSE_KEY}"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/newrelic]

Honeycomb

otel-collector-config.yaml
exporters:
otlp/honeycomb:
endpoint: "api.honeycomb.io:443"
headers:
x-honeycomb-team: "${HONEYCOMB_API_KEY}"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/honeycomb]

Docker Compose Example

Complete monitoring stack with edge node, collector, Prometheus, and Grafana:

docker-compose.yml
version: '3.8'

services:
expanso-edge:
image: ghcr.io/expanso-io/expanso-edge:latest
environment:
- EXPANSO_EDGE_NAME=edge-docker-1
- HOSTNAME=edge-docker-1
volumes:
- ./edge-config.yaml:/etc/expanso/config.yaml
- edge-data:/var/lib/expanso-edge
depends_on:
- otel-collector

otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml

prometheus:
image: prom/prometheus:latest
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-remote-write-receiver'
ports:
- "9090:9090"
volumes:
- prometheus-data:/prometheus

grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana

volumes:
edge-data:
prometheus-data:
grafana-data:

Troubleshooting

No Metrics Appearing

  1. Enable debug logging on edge node:
edge-config.yaml
log:
level: debug
format: json

Check logs for telemetry export attempts:

journalctl -u expanso-edge | grep -i telemetry
# or
docker logs expanso-edge | grep -i telemetry
  1. Verify collector is receiving:
docker logs otel-collector
# Look for: "OTLP receiver started"
  1. Test connectivity:
# From edge node, verify collector is reachable
telnet collector.example.com 4317

Authentication Errors

Look for authentication failures in logs:

"error": "failed to export metrics: rpc error: code = Unauthenticated"

Verify:

  • Token/API key is correct
  • Headers are properly formatted
  • Authentication type matches collector config

Metrics Have Wrong Names

OTLP may transform metric names. Check your monitoring backend's OTLP documentation for name transformations.

High Memory Usage

If Go metrics are enabled and memory usage is high:

telemetry:
include_go_metrics: false # Disable Go runtime metrics
process_metrics_interval: 30s # Reduce collection frequency

Security Best Practices

  1. Always use TLS in production:

    telemetry:
    endpoint: "collector.example.com:4317"
    insecure: false # Verify TLS certificates
  2. Use authentication:

    telemetry:
    headers:
    Authorization: "Bearer ${OTEL_TOKEN}"
  3. Network isolation:

    • Keep collector on private network
    • Use firewall rules to restrict access to port 4317/4318
  4. Rotate credentials regularly:

    • Use environment variables for tokens
    • Implement token rotation policy
  5. Limit resource attributes:

    • Don't include sensitive data in attributes
    • Keep cardinality reasonable

Performance Considerations

Export Interval

telemetry:
# Lower = more frequent updates, higher overhead
# Higher = less overhead, delayed metrics
export_interval: 30s # Good default

# High-frequency (more overhead):
# export_interval: 10s

# Low-frequency (less overhead):
# export_interval: 60s

Process Metrics Interval

telemetry:
# How often to collect process metrics
process_metrics_interval: 15s # Default

# Reduce overhead with less frequent collection:
# process_metrics_interval: 30s

Go Metrics

telemetry:
# Disable if not needed (reduces overhead):
include_go_metrics: false

# Enable for debugging memory/GC issues:
include_go_metrics: true

Next Steps