Skip to main content

OpenTelemetry Metrics Export

Monitor your Expanso Edge nodes by exporting metrics to your observability platform using OpenTelemetry (OTLP). Track pipeline performance, resource usage, and system health across your entire edge fleet.

Edge Node Monitoring vs Fleet Monitoring

Architecture

Expanso Edge uses OpenTelemetry Protocol (OTLP) to push metrics to a collector, which then exports to your monitoring backend:

┌─────────────────┐      OTLP        ┌──────────────────┐
│ Expanso Edge │ ──────────────> │ OpenTelemetry │
│ Node │ (push, gRPC) │ Collector │
└─────────────────┘ └──────────────────┘

┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌──────────┐
│Prometheus│ │ Grafana │ │ Datadog │
│ │ │ Cloud │ │ │
└──────────┘ └─────────┘ └──────────┘

Why OTLP instead of Prometheus scraping?

  • ✅ Works through firewalls/NAT (push, not pull)
  • ✅ Single metrics pipeline for multiple backends
  • ✅ No inbound ports required on edge nodes
  • ✅ Centralized collector for filtering/routing

Quick Start

1. Configure Edge Node

Enable telemetry export in your edge configuration:

edge-config.yaml
name: edge-node-1
data_dir: /var/lib/expanso-edge

# Enable telemetry export
telemetry:
# OpenTelemetry Collector endpoint
endpoint: "otel-collector.example.com:4317"
protocol: grpc
export_interval: 30s

# Include Go runtime metrics
include_go_metrics: true
process_metrics_interval: 15s

# Tag all metrics with these attributes
resource_attributes:
service.name: "expanso-edge"
environment: "production"
region: "us-west-2"

2. Deploy OpenTelemetry Collector

otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"

exporters:
# Export to Prometheus via remote write
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]

3. Query Metrics

Once in Prometheus, query edge node metrics:

# Pipeline readiness time
histogram_quantile(0.95, rate(pipeline_readiness_duration_bucket[5m]))

# Memory usage in MB
process_memory_usage / 1024 / 1024

# Pipeline errors
rate(pipeline_orchestration_errors_total[5m])

Available Metrics

Pipeline Metrics

Metrics from data pipeline execution:

MetricTypeDescription
pipeline.readiness.durationHistogramTime for pipeline to become ready during startup
pipeline.orchestration.errorsCounterNumber of pipeline orchestration errors

Pipeline Component Attributes

Pipeline metrics get tagged with attributes that let you filter and identify metrics for specific components in your data pipelines. These attributes are automatically attached to pipeline-level metrics so you can monitor and troubleshoot individual components.

Available Component Attributes

AttributeDescriptionExample ValueWhen Available
component_idUUID from visual builderuuid-proc-123When component created in visual builder
component_labelUser-friendly component nameData Filter, Kafka OutputWhen label set in pipeline config
component_nameBenthos component typebloblang, kafka, http_clientAlways available
component_typeComponent categoryinput, processor, outputAlways available

These attributes let you:

  • Filter metrics by component type (query only processor metrics or only output metrics)
  • Track specific components (monitor the component you labeled "Data Filter" across all pipelines)
  • Correlate with visual builder (match metrics to components using the UUID from your visual pipeline)
  • Group and aggregate (group error rates by component type or specific component names)

Example Queries

Monitor error rate for a specific component by label:

rate(component_errors_total{component_label="Data Filter"}[5m])

Track all Kafka outputs across your fleet:

component_throughput{component_name="kafka", component_type="output"}

Group errors by component type:

sum by (component_type) (
rate(component_errors_total[5m])
)

Monitor a specific visual builder component:

component_status{component_id="uuid-proc-123"}
Component Labels for Better Observability

Adding descriptive labels to your pipeline components makes your metrics way easier to filter and understand. Instead of tracking "root.pipeline.processors.2", you can monitor "Data Enrichment" or "PII Filter". Labels are set in your pipeline YAML with the label field.

Setting Component Labels

Add labels to components in your pipeline configuration:

pipeline:
processors:
- label: "Data Filter" # This becomes component_label
bloblang: |
root = this.filter(v -> v.status == "active")

- label: "PII Redaction" # This becomes component_label
mapping: |
root.email = this.email.redact_email()
root.ssn = this.ssn.redact_ssn()

When these components report metrics, they'll include component_label="Data Filter" and component_label="PII Redaction", making them easy to identify in your monitoring dashboard.

Process Metrics

System resource metrics (collected automatically):

MetricTypeDescription
process.cpu.timeCounterCPU time consumed by the process
process.memory.usageGaugeResident memory size (RSS)
process.memory.virtualGaugeVirtual memory size
process.open_file_descriptorsGaugeNumber of open file descriptors (Unix/Linux)

Platform support: Linux, macOS, Windows

Go Runtime Metrics

When include_go_metrics: true:

MetricTypeDescription
runtime.go.goroutinesGaugeNumber of active goroutines
runtime.go.gc.countCounterGarbage collection count
runtime.go.gc.pause_nsHistogramGC pause duration
runtime.go.mem.heap_allocGaugeBytes allocated on heap
runtime.go.mem.heap_sysGaugeHeap system memory
runtime.go.mem.heap_idleGaugeIdle heap memory
runtime.go.mem.heap_inuseGaugeIn-use heap memory
runtime.go.mem.heap_releasedGaugeReleased heap memory

HTTP API Metrics

Metrics from the HTTP API servers on the orchestrator and edge nodes. Use these to monitor API usage, authentication issues, and validation errors.

Request Metrics

MetricTypeDescription
http.server.request.durationHistogramDuration of HTTP requests in seconds
http.server.request.countCounterTotal number of HTTP requests
http.server.active_requestsUpDownCounterCurrently active requests
http.server.response.body.sizeHistogramSize of HTTP responses in bytes

Labels:

  • http.request.method - HTTP method (GET, POST, etc.)
  • http.route - Matched route pattern (e.g., /api/v1/jobs/:id)
  • http.response.status_code - Response status code
  • error.type - Error type for 4xx/5xx responses (e.g., not_found, internal_server_error)

Auth and Validation Metrics

MetricTypeDescription
http.server.auth.failuresCounterAuthentication failures
http.server.validation.failuresCounterRequest validation failures

Authentication failure labels:

  • http.request.method - HTTP method
  • http.route - Matched route pattern
  • auth.failure.reason - Either missing_token or invalid_token

Validation failure labels:

  • http.request.method - HTTP method
  • http.route - Matched route pattern
  • validation.type - Either struct, submission, or custom

Example Queries

# Average request latency by endpoint (p95)
histogram_quantile(0.95, rate(http_server_request_duration_bucket[5m]))

# Request rate by HTTP method
rate(http_server_request_count[5m])

# Authentication failure rate
rate(http_server_auth_failures[5m])

# Error rate (4xx/5xx responses)
sum(rate(http_server_request_count{http_response_status_code=~"4..|5.."}[5m]))

Metric Filtering

Edge nodes automatically drop certain high-cardinality metrics to reduce telemetry costs and noise. You can control which metrics get filtered using drop_metric_prefixes.

Default Behavior

Edge nodes drop these metrics by default:

PrefixMetrics DroppedReason
db.Database client metrics (db.client.operation.duration, etc.)High cardinality from operation/table labels
ncl.NCL messaging metricsInternal transport, not needed for edge monitoring
ncltransport.NCL transport metricsInternal transport, not needed for edge monitoring

Orchestrators export all metrics by default (no filtering).

Re-enabling Metrics

To export all metrics from an edge node (useful for debugging):

edge-config.yaml
telemetry:
endpoint: "collector.example.com:4317"
drop_metric_prefixes: [] # Empty array = keep all metrics

Custom Metric Filtering

Drop additional metrics to further reduce costs:

edge-config.yaml
telemetry:
endpoint: "collector.example.com:4317"
drop_metric_prefixes:
- "db."
- "ncl."
- "ncltransport."
- "store_gc." # Drop GC cleanup metrics
- "go-runtime." # Drop Go runtime metrics

Available Metric Prefixes

PrefixDescriptionComponent
process.Process metrics (CPU, memory, file descriptors)All
go-runtime.Go runtime metrics (GC, goroutines)All (opt-in)
db.Database client metricsAll
store_gc.Store garbage collectionAll
pipeline.Pipeline orchestrationEdge
ncl.NCL messagingAll
ncltransport.NCL transportAll
http.server.HTTP server metricsOrchestrator
evaluation.Evaluation metricsOrchestrator
scheduler.Scheduler metricsOrchestrator
Cost Optimization

If you're sending metrics to a hosted service like Grafana Cloud or Datadog, filtering unused metrics at the source can significantly reduce your telemetry costs. Start with the defaults and only re-enable metrics you actually need.

Configuration Reference

Complete Telemetry Config

edge-config.yaml
telemetry:
# Disable all telemetry (default: false)
do_not_track: false

# Collector endpoint (required)
endpoint: "collector.example.com:4317"

# Optional path under endpoint (e.g., "/v1/metrics")
endpoint_path: ""

# Protocol: "grpc" (recommended, port 4317) or "http" (port 4318)
protocol: grpc

# Skip TLS verification (NOT recommended for production)
insecure: false

# How often to export metrics (default: 30s)
export_interval: 30s

# Custom headers for authentication
headers:
Authorization: "Bearer your-api-token"
X-Custom-Header: "value"

# Resource attributes (tags/labels applied to all metrics)
resource_attributes:
service.name: "expanso-edge"
service.version: "1.0.0"
deployment.environment: "production"
cloud.region: "us-west-2"
cloud.availability_zone: "us-west-2a"
host.name: "${HOSTNAME}"

# Include Go runtime metrics (default: false)
include_go_metrics: true

# Process metrics collection interval (default: 15s)
process_metrics_interval: 15s

# Metric prefixes to drop (default: [] for orchestrator, ["db.", "ncl.", "ncltransport."] for edge)
drop_metric_prefixes:
- "db."
- "ncl."
- "ncltransport."

# Alternative authentication config
authentication:
type: "Bearer" # or "Basic"
token: "your-bearer-token"
namespace: "production"
tenant: "acme-corp"

Authentication

Method 1: Headers (recommended)

telemetry:
endpoint: "collector.example.com:4317"
protocol: grpc
headers:
Authorization: "Bearer ${OTEL_TOKEN}"

Method 2: Authentication Config

telemetry:
endpoint: "collector.example.com:4317"
protocol: grpc
authentication:
type: "Bearer"
token: "${OTEL_TOKEN}"
namespace: "production"

Resource Attributes

Resource attributes are key-value pairs that identify the source of telemetry data. They're attached to every metric, trace, and log exported from your edge nodes, making it easy to filter and group data in your monitoring backend.

telemetry:
resource_attributes:
service.name: "expanso-edge" # Identifies the service
deployment.environment: "production" # Environment (dev/staging/prod)
cloud.region: "us-west-2" # Geographic location
host.name: "${HOSTNAME}" # Uses environment variable

Common attributes:

AttributeDescriptionExample
service.nameIdentifies the serviceexpanso-edge
service.versionApplication version1.2.0
deployment.environmentDeployment environmentproduction, staging
cloud.regionCloud regionus-west-2, eu-central-1
cloud.availability_zoneAvailability zoneus-west-2a
host.nameHostnameedge-node-01

Tips:

  • Use consistent naming across your fleet for effective filtering
  • Avoid high-cardinality values (like UUIDs) in attributes - they can bloat your metrics database
  • Environment variables (${VAR}) are expanded at runtime

For the full list of semantic conventions, see the OpenTelemetry Semantic Conventions.


Monitoring Backend Setup

Prometheus + Grafana

OpenTelemetry Collector Config:

otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"

exporters:
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
external_labels:
cluster: "edge-fleet-1"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]

Prometheus Config:

prometheus.yml
global:
scrape_interval: 15s

# Enable remote write receiver
# Start with: --web.enable-remote-write-receiver

Grafana Dashboard Queries:

# Pipeline readiness (p95)
histogram_quantile(0.95,
rate(pipeline_readiness_duration_bucket{service_name="expanso-edge"}[5m])
)

# Memory usage per node
process_memory_usage{service_name="expanso-edge"} / 1024 / 1024

# Pipeline error rate
rate(pipeline_orchestration_errors_total[5m])

# CPU usage percentage
rate(process_cpu_time{service_name="expanso-edge"}[5m]) * 100

# Goroutines (if Go metrics enabled)
runtime_go_goroutines{service_name="expanso-edge"}

Grafana Cloud

otel-collector-config.yaml
exporters:
prometheusremotewrite:
endpoint: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
headers:
Authorization: "Bearer ${GRAFANA_CLOUD_API_KEY}"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]

Datadog

otel-collector-config.yaml
exporters:
datadog:
api:
key: "${DD_API_KEY}"
site: datadoghq.com

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [datadog]

Elastic (ELK Stack)

otel-collector-config.yaml
exporters:
otlp/elastic:
endpoint: "https://elastic-apm-server:8200"
headers:
Authorization: "Bearer ${ELASTIC_APM_TOKEN}"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/elastic]

New Relic

otel-collector-config.yaml
exporters:
otlp/newrelic:
endpoint: "https://otlp.nr-data.net:4317"
headers:
api-key: "${NEW_RELIC_LICENSE_KEY}"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/newrelic]

Honeycomb

otel-collector-config.yaml
exporters:
otlp/honeycomb:
endpoint: "api.honeycomb.io:443"
headers:
x-honeycomb-team: "${HONEYCOMB_API_KEY}"

service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/honeycomb]

Docker Compose Example

Complete monitoring stack with edge node, collector, Prometheus, and Grafana:

docker-compose.yml
version: '3.8'

services:
expanso-edge:
image: ghcr.io/expanso-io/expanso-edge:latest
environment:
- EXPANSO_EDGE_NAME=edge-docker-1
- HOSTNAME=edge-docker-1
volumes:
- ./edge-config.yaml:/etc/expanso/config.yaml
- edge-data:/var/lib/expanso-edge
depends_on:
- otel-collector

otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml

prometheus:
image: prom/prometheus:latest
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-remote-write-receiver'
ports:
- "9090:9090"
volumes:
- prometheus-data:/prometheus

grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana

volumes:
edge-data:
prometheus-data:
grafana-data:

Troubleshooting

No Metrics Appearing

  1. Enable debug logging on edge node:
edge-config.yaml
log:
level: debug
format: json

Check logs for telemetry export attempts:

journalctl -u expanso-edge | grep -i telemetry
# or
docker logs expanso-edge | grep -i telemetry
  1. Verify collector is receiving:
docker logs otel-collector
# Look for: "OTLP receiver started"
  1. Test connectivity:
# From edge node, verify collector is reachable
telnet collector.example.com 4317

Authentication Errors

Look for authentication failures in logs:

"error": "failed to export metrics: rpc error: code = Unauthenticated"

Verify:

  • Token/API key is correct
  • Headers are properly formatted
  • Authentication type matches collector config

Metrics Have Wrong Names

OTLP may transform metric names. Check your monitoring backend's OTLP documentation for name transformations.

High Memory Usage

If Go metrics are enabled and memory usage is high:

telemetry:
include_go_metrics: false # Disable Go runtime metrics
process_metrics_interval: 30s # Reduce collection frequency

Security Best Practices

  1. Always use TLS in production:

    telemetry:
    endpoint: "collector.example.com:4317"
    insecure: false # Verify TLS certificates
  2. Use authentication:

    telemetry:
    headers:
    Authorization: "Bearer ${OTEL_TOKEN}"
  3. Network isolation:

    • Keep collector on private network
    • Use firewall rules to restrict access to port 4317/4318
  4. Rotate credentials regularly:

    • Use environment variables for tokens
    • Implement token rotation policy
  5. Limit resource attributes:

    • Don't include sensitive data in attributes
    • Keep cardinality reasonable

Performance Considerations

Export Interval

telemetry:
# Lower = more frequent updates, higher overhead
# Higher = less overhead, delayed metrics
export_interval: 30s # Good default

# High-frequency (more overhead):
# export_interval: 10s

# Low-frequency (less overhead):
# export_interval: 60s

Process Metrics Interval

telemetry:
# How often to collect process metrics
process_metrics_interval: 15s # Default

# Reduce overhead with less frequent collection:
# process_metrics_interval: 30s

Go Metrics

telemetry:
# Disable if not needed (reduces overhead):
include_go_metrics: false

# Enable for debugging memory/GC issues:
include_go_metrics: true

Next Steps