OpenTelemetry Metrics Export
Monitor your Expanso Edge nodes by exporting metrics to your observability platform using OpenTelemetry (OTLP). Track pipeline performance, resource usage, and system health across your entire edge fleet.
Architecture
Expanso Edge uses OpenTelemetry Protocol (OTLP) to push metrics to a collector, which then exports to your monitoring backend:
┌─────────────────┐ OTLP ┌──────────────────┐
│ Expanso Edge │ ──────────────> │ OpenTelemetry │
│ Node │ (push, gRPC) │ Collector │
└─────────────────┘ └──────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌─────────┐ ┌──────────┐
│Prometheus│ │ Grafana │ │ Datadog │
│ │ │ Cloud │ │ │
└──────────┘ └─────────┘ └──────────┘
Why OTLP instead of Prometheus scraping?
- ✅ Works through firewalls/NAT (push, not pull)
- ✅ Single metrics pipeline for multiple backends
- ✅ No inbound ports required on edge nodes
- ✅ Centralized collector for filtering/routing
Quick Start
1. Configure Edge Node
Enable telemetry export in your edge configuration:
name: edge-node-1
data_dir: /var/lib/expanso-edge
# Enable telemetry export
telemetry:
# OpenTelemetry Collector endpoint
endpoint: "otel-collector.example.com:4317"
protocol: grpc
export_interval: 30s
# Include Go runtime metrics
include_go_metrics: true
process_metrics_interval: 15s
# Tag all metrics with these attributes
resource_attributes:
service.name: "expanso-edge"
environment: "production"
region: "us-west-2"
2. Deploy OpenTelemetry Collector
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
exporters:
# Export to Prometheus via remote write
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]
3. Query Metrics
Once in Prometheus, query edge node metrics:
# Pipeline readiness time
histogram_quantile(0.95, rate(pipeline_readiness_duration_bucket[5m]))
# Memory usage in MB
process_memory_usage / 1024 / 1024
# Pipeline errors
rate(pipeline_orchestration_errors_total[5m])
Available Metrics
Pipeline Metrics
Metrics from data pipeline execution:
| Metric | Type | Description |
|---|---|---|
pipeline.readiness.duration | Histogram | Time for pipeline to become ready during startup |
pipeline.orchestration.errors | Counter | Number of pipeline orchestration errors |
Pipeline Component Attributes
Pipeline metrics get tagged with attributes that let you filter and identify metrics for specific components in your data pipelines. These attributes are automatically attached to pipeline-level metrics so you can monitor and troubleshoot individual components.
Available Component Attributes
| Attribute | Description | Example Value | When Available |
|---|---|---|---|
component_id | UUID from visual builder | uuid-proc-123 | When component created in visual builder |
component_label | User-friendly component name | Data Filter, Kafka Output | When label set in pipeline config |
component_name | Benthos component type | bloblang, kafka, http_client | Always available |
component_type | Component category | input, processor, output | Always available |
These attributes let you:
- Filter metrics by component type (query only processor metrics or only output metrics)
- Track specific components (monitor the component you labeled "Data Filter" across all pipelines)
- Correlate with visual builder (match metrics to components using the UUID from your visual pipeline)
- Group and aggregate (group error rates by component type or specific component names)
Example Queries
Monitor error rate for a specific component by label:
rate(component_errors_total{component_label="Data Filter"}[5m])
Track all Kafka outputs across your fleet:
component_throughput{component_name="kafka", component_type="output"}
Group errors by component type:
sum by (component_type) (
rate(component_errors_total[5m])
)
Monitor a specific visual builder component:
component_status{component_id="uuid-proc-123"}
Adding descriptive labels to your pipeline components makes your metrics way easier to filter and understand. Instead of tracking "root.pipeline.processors.2", you can monitor "Data Enrichment" or "PII Filter". Labels are set in your pipeline YAML with the label field.
Setting Component Labels
Add labels to components in your pipeline configuration:
pipeline:
processors:
- label: "Data Filter" # This becomes component_label
bloblang: |
root = this.filter(v -> v.status == "active")
- label: "PII Redaction" # This becomes component_label
mapping: |
root.email = this.email.redact_email()
root.ssn = this.ssn.redact_ssn()
When these components report metrics, they'll include component_label="Data Filter" and component_label="PII Redaction", making them easy to identify in your monitoring dashboard.
Process Metrics
System resource metrics (collected automatically):
| Metric | Type | Description |
|---|---|---|
process.cpu.time | Counter | CPU time consumed by the process |
process.memory.usage | Gauge | Resident memory size (RSS) |
process.memory.virtual | Gauge | Virtual memory size |
process.open_file_descriptors | Gauge | Number of open file descriptors (Unix/Linux) |
Platform support: Linux, macOS, Windows
Go Runtime Metrics
When include_go_metrics: true:
| Metric | Type | Description |
|---|---|---|
runtime.go.goroutines | Gauge | Number of active goroutines |
runtime.go.gc.count | Counter | Garbage collection count |
runtime.go.gc.pause_ns | Histogram | GC pause duration |
runtime.go.mem.heap_alloc | Gauge | Bytes allocated on heap |
runtime.go.mem.heap_sys | Gauge | Heap system memory |
runtime.go.mem.heap_idle | Gauge | Idle heap memory |
runtime.go.mem.heap_inuse | Gauge | In-use heap memory |
runtime.go.mem.heap_released | Gauge | Released heap memory |
HTTP API Metrics
Metrics from the HTTP API servers on the orchestrator and edge nodes. Use these to monitor API usage, authentication issues, and validation errors.
Request Metrics
| Metric | Type | Description |
|---|---|---|
http.server.request.duration | Histogram | Duration of HTTP requests in seconds |
http.server.request.count | Counter | Total number of HTTP requests |
http.server.active_requests | UpDownCounter | Currently active requests |
http.server.response.body.size | Histogram | Size of HTTP responses in bytes |
Labels:
http.request.method- HTTP method (GET, POST, etc.)http.route- Matched route pattern (e.g.,/api/v1/jobs/:id)http.response.status_code- Response status codeerror.type- Error type for 4xx/5xx responses (e.g.,not_found,internal_server_error)
Auth and Validation Metrics
| Metric | Type | Description |
|---|---|---|
http.server.auth.failures | Counter | Authentication failures |
http.server.validation.failures | Counter | Request validation failures |
Authentication failure labels:
http.request.method- HTTP methodhttp.route- Matched route patternauth.failure.reason- Eithermissing_tokenorinvalid_token
Validation failure labels:
http.request.method- HTTP methodhttp.route- Matched route patternvalidation.type- Eitherstruct,submission, orcustom
Example Queries
# Average request latency by endpoint (p95)
histogram_quantile(0.95, rate(http_server_request_duration_bucket[5m]))
# Request rate by HTTP method
rate(http_server_request_count[5m])
# Authentication failure rate
rate(http_server_auth_failures[5m])
# Error rate (4xx/5xx responses)
sum(rate(http_server_request_count{http_response_status_code=~"4..|5.."}[5m]))
Metric Filtering
Edge nodes automatically drop certain high-cardinality metrics to reduce telemetry costs and noise. You can control which metrics get filtered using drop_metric_prefixes.
Default Behavior
Edge nodes drop these metrics by default:
| Prefix | Metrics Dropped | Reason |
|---|---|---|
db. | Database client metrics (db.client.operation.duration, etc.) | High cardinality from operation/table labels |
ncl. | NCL messaging metrics | Internal transport, not needed for edge monitoring |
ncltransport. | NCL transport metrics | Internal transport, not needed for edge monitoring |
Orchestrators export all metrics by default (no filtering).
Re-enabling Metrics
To export all metrics from an edge node (useful for debugging):
telemetry:
endpoint: "collector.example.com:4317"
drop_metric_prefixes: [] # Empty array = keep all metrics
Custom Metric Filtering
Drop additional metrics to further reduce costs:
telemetry:
endpoint: "collector.example.com:4317"
drop_metric_prefixes:
- "db."
- "ncl."
- "ncltransport."
- "store_gc." # Drop GC cleanup metrics
- "go-runtime." # Drop Go runtime metrics
Available Metric Prefixes
| Prefix | Description | Component |
|---|---|---|
process. | Process metrics (CPU, memory, file descriptors) | All |
go-runtime. | Go runtime metrics (GC, goroutines) | All (opt-in) |
db. | Database client metrics | All |
store_gc. | Store garbage collection | All |
pipeline. | Pipeline orchestration | Edge |
ncl. | NCL messaging | All |
ncltransport. | NCL transport | All |
http.server. | HTTP server metrics | Orchestrator |
evaluation. | Evaluation metrics | Orchestrator |
scheduler. | Scheduler metrics | Orchestrator |
If you're sending metrics to a hosted service like Grafana Cloud or Datadog, filtering unused metrics at the source can significantly reduce your telemetry costs. Start with the defaults and only re-enable metrics you actually need.
Configuration Reference
Complete Telemetry Config
telemetry:
# Disable all telemetry (default: false)
do_not_track: false
# Collector endpoint (required)
endpoint: "collector.example.com:4317"
# Optional path under endpoint (e.g., "/v1/metrics")
endpoint_path: ""
# Protocol: "grpc" (recommended, port 4317) or "http" (port 4318)
protocol: grpc
# Skip TLS verification (NOT recommended for production)
insecure: false
# How often to export metrics (default: 30s)
export_interval: 30s
# Custom headers for authentication
headers:
Authorization: "Bearer your-api-token"
X-Custom-Header: "value"
# Resource attributes (tags/labels applied to all metrics)
resource_attributes:
service.name: "expanso-edge"
service.version: "1.0.0"
deployment.environment: "production"
cloud.region: "us-west-2"
cloud.availability_zone: "us-west-2a"
host.name: "${HOSTNAME}"
# Include Go runtime metrics (default: false)
include_go_metrics: true
# Process metrics collection interval (default: 15s)
process_metrics_interval: 15s
# Metric prefixes to drop (default: [] for orchestrator, ["db.", "ncl.", "ncltransport."] for edge)
drop_metric_prefixes:
- "db."
- "ncl."
- "ncltransport."
# Alternative authentication config
authentication:
type: "Bearer" # or "Basic"
token: "your-bearer-token"
namespace: "production"
tenant: "acme-corp"
Authentication
Method 1: Headers (recommended)
telemetry:
endpoint: "collector.example.com:4317"
protocol: grpc
headers:
Authorization: "Bearer ${OTEL_TOKEN}"
Method 2: Authentication Config
telemetry:
endpoint: "collector.example.com:4317"
protocol: grpc
authentication:
type: "Bearer"
token: "${OTEL_TOKEN}"
namespace: "production"
Resource Attributes
Resource attributes are key-value pairs that identify the source of telemetry data. They're attached to every metric, trace, and log exported from your edge nodes, making it easy to filter and group data in your monitoring backend.
telemetry:
resource_attributes:
service.name: "expanso-edge" # Identifies the service
deployment.environment: "production" # Environment (dev/staging/prod)
cloud.region: "us-west-2" # Geographic location
host.name: "${HOSTNAME}" # Uses environment variable
Common attributes:
| Attribute | Description | Example |
|---|---|---|
service.name | Identifies the service | expanso-edge |
service.version | Application version | 1.2.0 |
deployment.environment | Deployment environment | production, staging |
cloud.region | Cloud region | us-west-2, eu-central-1 |
cloud.availability_zone | Availability zone | us-west-2a |
host.name | Hostname | edge-node-01 |
Tips:
- Use consistent naming across your fleet for effective filtering
- Avoid high-cardinality values (like UUIDs) in attributes - they can bloat your metrics database
- Environment variables (
${VAR}) are expanded at runtime
For the full list of semantic conventions, see the OpenTelemetry Semantic Conventions.
Monitoring Backend Setup
Prometheus + Grafana
OpenTelemetry Collector Config:
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
exporters:
prometheusremotewrite:
endpoint: "http://prometheus:9090/api/v1/write"
external_labels:
cluster: "edge-fleet-1"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]
Prometheus Config:
global:
scrape_interval: 15s
# Enable remote write receiver
# Start with: --web.enable-remote-write-receiver
Grafana Dashboard Queries:
# Pipeline readiness (p95)
histogram_quantile(0.95,
rate(pipeline_readiness_duration_bucket{service_name="expanso-edge"}[5m])
)
# Memory usage per node
process_memory_usage{service_name="expanso-edge"} / 1024 / 1024
# Pipeline error rate
rate(pipeline_orchestration_errors_total[5m])
# CPU usage percentage
rate(process_cpu_time{service_name="expanso-edge"}[5m]) * 100
# Goroutines (if Go metrics enabled)
runtime_go_goroutines{service_name="expanso-edge"}
Grafana Cloud
exporters:
prometheusremotewrite:
endpoint: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
headers:
Authorization: "Bearer ${GRAFANA_CLOUD_API_KEY}"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [prometheusremotewrite]
Datadog
exporters:
datadog:
api:
key: "${DD_API_KEY}"
site: datadoghq.com
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [datadog]
Elastic (ELK Stack)
exporters:
otlp/elastic:
endpoint: "https://elastic-apm-server:8200"
headers:
Authorization: "Bearer ${ELASTIC_APM_TOKEN}"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/elastic]
New Relic
exporters:
otlp/newrelic:
endpoint: "https://otlp.nr-data.net:4317"
headers:
api-key: "${NEW_RELIC_LICENSE_KEY}"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/newrelic]
Honeycomb
exporters:
otlp/honeycomb:
endpoint: "api.honeycomb.io:443"
headers:
x-honeycomb-team: "${HONEYCOMB_API_KEY}"
service:
pipelines:
metrics:
receivers: [otlp]
exporters: [otlp/honeycomb]
Docker Compose Example
Complete monitoring stack with edge node, collector, Prometheus, and Grafana:
version: '3.8'
services:
expanso-edge:
image: ghcr.io/expanso-io/expanso-edge:latest
environment:
- EXPANSO_EDGE_NAME=edge-docker-1
- HOSTNAME=edge-docker-1
volumes:
- ./edge-config.yaml:/etc/expanso/config.yaml
- edge-data:/var/lib/expanso-edge
depends_on:
- otel-collector
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
prometheus:
image: prom/prometheus:latest
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--web.enable-remote-write-receiver'
ports:
- "9090:9090"
volumes:
- prometheus-data:/prometheus
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
edge-data:
prometheus-data:
grafana-data:
Troubleshooting
No Metrics Appearing
- Enable debug logging on edge node:
log:
level: debug
format: json
Check logs for telemetry export attempts:
journalctl -u expanso-edge | grep -i telemetry
# or
docker logs expanso-edge | grep -i telemetry
- Verify collector is receiving:
docker logs otel-collector
# Look for: "OTLP receiver started"
- Test connectivity:
# From edge node, verify collector is reachable
telnet collector.example.com 4317
Authentication Errors
Look for authentication failures in logs:
"error": "failed to export metrics: rpc error: code = Unauthenticated"
Verify:
- Token/API key is correct
- Headers are properly formatted
- Authentication type matches collector config
Metrics Have Wrong Names
OTLP may transform metric names. Check your monitoring backend's OTLP documentation for name transformations.
High Memory Usage
If Go metrics are enabled and memory usage is high:
telemetry:
include_go_metrics: false # Disable Go runtime metrics
process_metrics_interval: 30s # Reduce collection frequency
Security Best Practices
-
Always use TLS in production:
telemetry:
endpoint: "collector.example.com:4317"
insecure: false # Verify TLS certificates -
Use authentication:
telemetry:
headers:
Authorization: "Bearer ${OTEL_TOKEN}" -
Network isolation:
- Keep collector on private network
- Use firewall rules to restrict access to port 4317/4318
-
Rotate credentials regularly:
- Use environment variables for tokens
- Implement token rotation policy
-
Limit resource attributes:
- Don't include sensitive data in attributes
- Keep cardinality reasonable
Performance Considerations
Export Interval
telemetry:
# Lower = more frequent updates, higher overhead
# Higher = less overhead, delayed metrics
export_interval: 30s # Good default
# High-frequency (more overhead):
# export_interval: 10s
# Low-frequency (less overhead):
# export_interval: 60s
Process Metrics Interval
telemetry:
# How often to collect process metrics
process_metrics_interval: 15s # Default
# Reduce overhead with less frequent collection:
# process_metrics_interval: 30s
Go Metrics
telemetry:
# Disable if not needed (reduces overhead):
include_go_metrics: false
# Enable for debugging memory/GC issues:
include_go_metrics: true