What is Expanso and how does it work?

Expanso is a managed platform for deploying intelligent data pipelines at the edge. It processes data where it's generated - reducing bandwidth, latency, and costs. You deploy lightweight agents on your infrastructure, build pipelines using our visual builder or YAML, and control everything from a central SaaS platform.

Can I run AI/ML models directly in my data pipelines?

Yes! Expanso supports running ONNX, TensorFlow Lite, and other models as native pipeline steps. Execute low-latency inference on streaming data, enrich events with model outputs (like risk scores), and make decisions at the edge without cloud round-trips.

How many pre-built components are available?

Expanso provides 200+ pre-built components including inputs (Kafka, HTTP, files), processors (transformations, filtering, PII masking, aggregations), and outputs (S3, Snowflake, Datadog, Splunk). Browse the complete catalog in our Component Reference.

Do I need to write code to build pipelines?

No - use our drag-and-drop visual pipeline builder to create sophisticated pipelines without code. For advanced use cases, you can also write pipelines in YAML or use the Bloblang transformation language for complex data mappings.

How does Expanso help with data governance and compliance?

Expanso includes built-in governance features: automatic PII detection and masking, policy enforcement at the edge, RBAC, SSO integration, and comprehensive audit trails. Mask sensitive data before it ever leaves your network.

OpenTelemetry Metrics Export

Monitor your Expanso Edge nodes by exporting metrics to your observability platform using OpenTelemetry (OTLP). Track pipeline performance, resource usage, and system health across your entire edge fleet.

Edge Node Monitoring vs Fleet Monitoring

Architecture

Expanso Edge uses OpenTelemetry Protocol (OTLP) to push metrics to a collector, which then exports to your monitoring backend:

┌─────────────────┐      OTLP        ┌──────────────────┐
│ Expanso Edge    │ ──────────────> │ OpenTelemetry    │
│ Node            │   (push, gRPC)   │ Collector        │
└─────────────────┘                  └──────────────────┘
                                              │
                        ┌─────────────────────┼─────────────────────┐
                        │                     │                     │
                        ▼                     ▼                     ▼
                  ┌──────────┐          ┌─────────┐          ┌──────────┐
                  │Prometheus│          │ Grafana │          │ Datadog  │
                  │          │          │  Cloud  │          │          │
                  └──────────┘          └─────────┘          └──────────┘

Why OTLP instead of Prometheus scraping?

✅ Works through firewalls/NAT (push, not pull)
✅ Single metrics pipeline for multiple backends
✅ No inbound ports required on edge nodes
✅ Centralized collector for filtering/routing

Quick Start

1. Configure Edge Node

Enable telemetry export in your edge configuration:

edge-config.yaml
name: edge-node-1
data_dir: /var/lib/expanso-edge

# Enable telemetry export
telemetry:
  # OpenTelemetry Collector endpoint
  endpoint: "otel-collector.example.com:4317"
  protocol: grpc
  export_interval: 30s

  # Include Go runtime metrics
  include_go_metrics: true
  process_metrics_interval: 15s

  # Tag all metrics with these attributes
  resource_attributes:
    service.name: "expanso-edge"
    environment: "production"
    region: "us-west-2"

2. Deploy OpenTelemetry Collector

otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

exporters:
  # Export to Prometheus via remote write
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]

3. Query Metrics

Once in Prometheus, query edge node metrics:

# Pipeline readiness time
histogram_quantile(0.95, rate(pipeline_readiness_duration_bucket[5m]))

# Memory usage in MB
process_memory_usage / 1024 / 1024

# Pipeline errors
rate(pipeline_orchestration_errors_total[5m])

Available Metrics

Pipeline Metrics

Metrics from data pipeline execution:

Metric	Type	Description
`pipeline.readiness.duration`	Histogram	Time for pipeline to become ready during startup
`pipeline.orchestration.errors`	Counter	Number of pipeline orchestration errors

Pipeline Component Attributes

Pipeline metrics get tagged with attributes that let you filter and identify metrics for specific components in your data pipelines. These attributes are automatically attached to pipeline-level metrics so you can monitor and troubleshoot individual components.

Available Component Attributes

Attribute	Description	Example Value	When Available
`component_id`	UUID from visual builder	`uuid-proc-123`	When component created in visual builder
`component_label`	User-friendly component name	`Data Filter`, `Kafka Output`	When label set in pipeline config
`component_name`	Benthos component type	`bloblang`, `kafka`, `http_client`	Always available
`component_type`	Component category	`input`, `processor`, `output`	Always available

These attributes let you:

Filter metrics by component type (query only processor metrics or only output metrics)
Track specific components (monitor the component you labeled "Data Filter" across all pipelines)
Correlate with visual builder (match metrics to components using the UUID from your visual pipeline)
Group and aggregate (group error rates by component type or specific component names)

Example Queries

Monitor error rate for a specific component by label:

rate(component_errors_total{component_label="Data Filter"}[5m])

Track all Kafka outputs across your fleet:

component_throughput{component_name="kafka", component_type="output"}

Group errors by component type:

sum by (component_type) (
  rate(component_errors_total[5m])
)

Monitor a specific visual builder component:

component_status{component_id="uuid-proc-123"}

Component Labels for Better Observability

Adding descriptive labels to your pipeline components makes your metrics way easier to filter and understand. Instead of tracking "root.pipeline.processors.2", you can monitor "Data Enrichment" or "PII Filter". Labels are set in your pipeline YAML with the label field.

Setting Component Labels

Add labels to components in your pipeline configuration:

pipeline:
  processors:
    - label: "Data Filter"  # This becomes component_label
      bloblang: |
        root = this.filter(v -> v.status == "active")
    
    - label: "PII Redaction"  # This becomes component_label  
      mapping: |
        root.email = this.email.redact_email()
        root.ssn = this.ssn.redact_ssn()

When these components report metrics, they'll include component_label="Data Filter" and component_label="PII Redaction", making them easy to identify in your monitoring dashboard.

Process Metrics

System resource metrics (collected automatically):

Metric	Type	Description
`process.cpu.time`	Counter	CPU time consumed by the process
`process.memory.usage`	Gauge	Resident memory size (RSS)
`process.memory.virtual`	Gauge	Virtual memory size
`process.open_file_descriptors`	Gauge	Number of open file descriptors (Unix/Linux)

Platform support: Linux, macOS, Windows

Go Runtime Metrics

When include_go_metrics: true:

Metric	Type	Description
`runtime.go.goroutines`	Gauge	Number of active goroutines
`runtime.go.gc.count`	Counter	Garbage collection count
`runtime.go.gc.pause_ns`	Histogram	GC pause duration
`runtime.go.mem.heap_alloc`	Gauge	Bytes allocated on heap
`runtime.go.mem.heap_sys`	Gauge	Heap system memory
`runtime.go.mem.heap_idle`	Gauge	Idle heap memory
`runtime.go.mem.heap_inuse`	Gauge	In-use heap memory
`runtime.go.mem.heap_released`	Gauge	Released heap memory

HTTP API Metrics

Metrics from the HTTP API servers on the orchestrator and edge nodes. Use these to monitor API usage, authentication issues, and validation errors.

Request Metrics

Metric	Type	Description
`http.server.request.duration`	Histogram	Duration of HTTP requests in seconds
`http.server.request.count`	Counter	Total number of HTTP requests
`http.server.active_requests`	UpDownCounter	Currently active requests
`http.server.response.body.size`	Histogram	Size of HTTP responses in bytes

Labels:

http.request.method - HTTP method (GET, POST, etc.)
http.route - Matched route pattern (e.g., /api/v1/jobs/:id)
http.response.status_code - Response status code
error.type - Error type for 4xx/5xx responses (e.g., not_found, internal_server_error)

Auth and Validation Metrics

Metric	Type	Description
`http.server.auth.failures`	Counter	Authentication failures
`http.server.validation.failures`	Counter	Request validation failures

Authentication failure labels:

http.request.method - HTTP method
http.route - Matched route pattern
auth.failure.reason - Either missing_token or invalid_token

Validation failure labels:

http.request.method - HTTP method
http.route - Matched route pattern
validation.type - Either struct, submission, or custom

Example Queries

# Average request latency by endpoint (p95)
histogram_quantile(0.95, rate(http_server_request_duration_bucket[5m]))

# Request rate by HTTP method
rate(http_server_request_count[5m])

# Authentication failure rate
rate(http_server_auth_failures[5m])

# Error rate (4xx/5xx responses)
sum(rate(http_server_request_count{http_response_status_code=~"4..|5.."}[5m]))

Metric Filtering

Edge nodes automatically drop certain high-cardinality metrics to reduce telemetry costs and noise. You can control which metrics get filtered using drop_metric_prefixes.

Default Behavior

Edge nodes drop these metrics by default:

Prefix	Metrics Dropped	Reason
`db.`	Database client metrics (`db.client.operation.duration`, etc.)	High cardinality from operation/table labels
`ncl.`	NCL messaging metrics	Internal transport, not needed for edge monitoring
`ncltransport.`	NCL transport metrics	Internal transport, not needed for edge monitoring

Orchestrators export all metrics by default (no filtering).

Re-enabling Metrics

To export all metrics from an edge node (useful for debugging):

edge-config.yaml
telemetry:
  endpoint: "collector.example.com:4317"
  drop_metric_prefixes: []  # Empty array = keep all metrics

Custom Metric Filtering

Drop additional metrics to further reduce costs:

edge-config.yaml
telemetry:
  endpoint: "collector.example.com:4317"
  drop_metric_prefixes:
    - "db."
    - "ncl."
    - "ncltransport."
    - "store_gc."      # Drop GC cleanup metrics
    - "go-runtime."    # Drop Go runtime metrics

Available Metric Prefixes

Prefix	Description	Component
`process.`	Process metrics (CPU, memory, file descriptors)	All
`go-runtime.`	Go runtime metrics (GC, goroutines)	All (opt-in)
`db.`	Database client metrics	All
`store_gc.`	Store garbage collection	All
`pipeline.`	Pipeline orchestration	Edge
`ncl.`	NCL messaging	All
`ncltransport.`	NCL transport	All
`http.server.`	HTTP server metrics	Orchestrator
`evaluation.`	Evaluation metrics	Orchestrator
`scheduler.`	Scheduler metrics	Orchestrator

Cost Optimization

If you're sending metrics to a hosted service like Grafana Cloud or Datadog, filtering unused metrics at the source can significantly reduce your telemetry costs. Start with the defaults and only re-enable metrics you actually need.

Configuration Reference

Complete Telemetry Config

edge-config.yaml
telemetry:
  # Disable all telemetry (default: false)
  do_not_track: false

  # Collector endpoint (required)
  endpoint: "collector.example.com:4317"

  # Optional path under endpoint (e.g., "/v1/metrics")
  endpoint_path: ""

  # Protocol: "grpc" (recommended, port 4317) or "http" (port 4318)
  protocol: grpc

  # Skip TLS verification (NOT recommended for production)
  insecure: false

  # How often to export metrics (default: 30s)
  export_interval: 30s

  # Custom headers for authentication
  headers:
    Authorization: "Bearer your-api-token"
    X-Custom-Header: "value"

  # Resource attributes (tags/labels applied to all metrics)
  resource_attributes:
    service.name: "expanso-edge"
    service.version: "1.0.0"
    deployment.environment: "production"
    cloud.region: "us-west-2"
    cloud.availability_zone: "us-west-2a"
    host.name: "${HOSTNAME}"

  # Include Go runtime metrics (default: false)
  include_go_metrics: true

  # Process metrics collection interval (default: 15s)
  process_metrics_interval: 15s

  # Metric prefixes to drop (default: [] for orchestrator, ["db.", "ncl.", "ncltransport."] for edge)
  drop_metric_prefixes:
    - "db."
    - "ncl."
    - "ncltransport."

  # Alternative authentication config
  authentication:
    type: "Bearer"  # or "Basic"
    token: "your-bearer-token"
    namespace: "production"
    tenant: "acme-corp"

Authentication

Method 1: Headers (recommended)

telemetry:
  endpoint: "collector.example.com:4317"
  protocol: grpc
  headers:
    Authorization: "Bearer ${OTEL_TOKEN}"

Method 2: Authentication Config

telemetry:
  endpoint: "collector.example.com:4317"
  protocol: grpc
  authentication:
    type: "Bearer"
    token: "${OTEL_TOKEN}"
    namespace: "production"

Resource Attributes

Resource attributes are key-value pairs that identify the source of telemetry data. They're attached to every metric, trace, and log exported from your edge nodes, making it easy to filter and group data in your monitoring backend.

telemetry:
  resource_attributes:
    service.name: "expanso-edge"           # Identifies the service
    deployment.environment: "production"   # Environment (dev/staging/prod)
    cloud.region: "us-west-2"              # Geographic location
    host.name: "${HOSTNAME}"               # Uses environment variable

Common attributes:

Attribute	Description	Example
`service.name`	Identifies the service	`expanso-edge`
`service.version`	Application version	`1.2.0`
`deployment.environment`	Deployment environment	`production`, `staging`
`cloud.region`	Cloud region	`us-west-2`, `eu-central-1`
`cloud.availability_zone`	Availability zone	`us-west-2a`
`host.name`	Hostname	`edge-node-01`

Tips:

Use consistent naming across your fleet for effective filtering
Avoid high-cardinality values (like UUIDs) in attributes - they can bloat your metrics database
Environment variables (${VAR}) are expanded at runtime

For the full list of semantic conventions, see the OpenTelemetry Semantic Conventions.

Monitoring Backend Setup

Prometheus + Grafana

OpenTelemetry Collector Config:

otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"

exporters:
  prometheusremotewrite:
    endpoint: "http://prometheus:9090/api/v1/write"
    external_labels:
      cluster: "edge-fleet-1"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]

Prometheus Config:

prometheus.yml
global:
  scrape_interval: 15s

# Enable remote write receiver
# Start with: --web.enable-remote-write-receiver

Grafana Dashboard Queries:

# Pipeline readiness (p95)
histogram_quantile(0.95,
  rate(pipeline_readiness_duration_bucket{service_name="expanso-edge"}[5m])
)

# Memory usage per node
process_memory_usage{service_name="expanso-edge"} / 1024 / 1024

# Pipeline error rate
rate(pipeline_orchestration_errors_total[5m])

# CPU usage percentage
rate(process_cpu_time{service_name="expanso-edge"}[5m]) * 100

# Goroutines (if Go metrics enabled)
runtime_go_goroutines{service_name="expanso-edge"}

Grafana Cloud

otel-collector-config.yaml
exporters:
  prometheusremotewrite:
    endpoint: "https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push"
    headers:
      Authorization: "Bearer ${GRAFANA_CLOUD_API_KEY}"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [prometheusremotewrite]

Datadog

otel-collector-config.yaml
exporters:
  datadog:
    api:
      key: "${DD_API_KEY}"
      site: datadoghq.com

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [datadog]

Elastic (ELK Stack)

otel-collector-config.yaml
exporters:
  otlp/elastic:
    endpoint: "https://elastic-apm-server:8200"
    headers:
      Authorization: "Bearer ${ELASTIC_APM_TOKEN}"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [otlp/elastic]

New Relic

otel-collector-config.yaml
exporters:
  otlp/newrelic:
    endpoint: "https://otlp.nr-data.net:4317"
    headers:
      api-key: "${NEW_RELIC_LICENSE_KEY}"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [otlp/newrelic]

Honeycomb

otel-collector-config.yaml
exporters:
  otlp/honeycomb:
    endpoint: "api.honeycomb.io:443"
    headers:
      x-honeycomb-team: "${HONEYCOMB_API_KEY}"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      exporters: [otlp/honeycomb]

Docker Compose Example

Complete monitoring stack with edge node, collector, Prometheus, and Grafana:

docker-compose.yml
version: '3.8'

services:
  expanso-edge:
    image: ghcr.io/expanso-io/expanso-edge:latest
    environment:
      - EXPANSO_EDGE_NAME=edge-docker-1
      - HOSTNAME=edge-docker-1
    volumes:
      - ./edge-config.yaml:/etc/expanso/config.yaml
      - edge-data:/var/lib/expanso-edge
    depends_on:
      - otel-collector

  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    ports:
      - "4317:4317"  # OTLP gRPC
      - "4318:4318"  # OTLP HTTP
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml

  prometheus:
    image: prom/prometheus:latest
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--web.enable-remote-write-receiver'
    ports:
      - "9090:9090"
    volumes:
      - prometheus-data:/prometheus

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-data:/var/lib/grafana

volumes:
  edge-data:
  prometheus-data:
  grafana-data:

Troubleshooting

No Metrics Appearing

Enable debug logging on edge node:

edge-config.yaml
log:
  level: debug
  format: json

Check logs for telemetry export attempts:

journalctl -u expanso-edge | grep -i telemetry
# or
docker logs expanso-edge | grep -i telemetry

Verify collector is receiving:

docker logs otel-collector
# Look for: "OTLP receiver started"

Test connectivity:

# From edge node, verify collector is reachable
telnet collector.example.com 4317

Authentication Errors

Look for authentication failures in logs:

"error": "failed to export metrics: rpc error: code = Unauthenticated"

Verify:

Token/API key is correct
Headers are properly formatted
Authentication type matches collector config

Metrics Have Wrong Names

OTLP may transform metric names. Check your monitoring backend's OTLP documentation for name transformations.

High Memory Usage

If Go metrics are enabled and memory usage is high:

telemetry:
  include_go_metrics: false  # Disable Go runtime metrics
  process_metrics_interval: 30s  # Reduce collection frequency

Security Best Practices

Always use TLS in production:

telemetry:
  endpoint: "collector.example.com:4317"
  insecure: false  # Verify TLS certificates

Use authentication:

telemetry:
  headers:
    Authorization: "Bearer ${OTEL_TOKEN}"

Network isolation:
- Keep collector on private network
- Use firewall rules to restrict access to port 4317/4318
Rotate credentials regularly:
- Use environment variables for tokens
- Implement token rotation policy
Limit resource attributes:
- Don't include sensitive data in attributes
- Keep cardinality reasonable

Performance Considerations

Export Interval

telemetry:
  # Lower = more frequent updates, higher overhead
  # Higher = less overhead, delayed metrics
  export_interval: 30s  # Good default

  # High-frequency (more overhead):
  # export_interval: 10s

  # Low-frequency (less overhead):
  # export_interval: 60s

Process Metrics Interval

telemetry:
  # How often to collect process metrics
  process_metrics_interval: 15s  # Default

  # Reduce overhead with less frequent collection:
  # process_metrics_interval: 30s

Go Metrics

telemetry:
  # Disable if not needed (reduces overhead):
  include_go_metrics: false

  # Enable for debugging memory/GC issues:
  include_go_metrics: true

Architecture​

Quick Start​

1. Configure Edge Node​

2. Deploy OpenTelemetry Collector​

3. Query Metrics​

Available Metrics​

Pipeline Metrics​

Pipeline Component Attributes​

Available Component Attributes​

Example Queries​

Setting Component Labels​

Process Metrics​

Go Runtime Metrics​

HTTP API Metrics​

Request Metrics​

Auth and Validation Metrics​

Example Queries​

Metric Filtering​

Default Behavior​

Re-enabling Metrics​

Custom Metric Filtering​

Available Metric Prefixes​

Configuration Reference​

Complete Telemetry Config​

Authentication​

Resource Attributes​

Monitoring Backend Setup​

Prometheus + Grafana​

Grafana Cloud​

Datadog​

Elastic (ELK Stack)​

New Relic​

Honeycomb​

Docker Compose Example​

Troubleshooting​

No Metrics Appearing​

Authentication Errors​

Metrics Have Wrong Names​

High Memory Usage​

Security Best Practices​

Performance Considerations​

Export Interval​

Process Metrics Interval​

Go Metrics​

Next Steps​

Architecture

Quick Start

1. Configure Edge Node

2. Deploy OpenTelemetry Collector

3. Query Metrics

Available Metrics

Pipeline Metrics

Pipeline Component Attributes

Available Component Attributes

Example Queries

Setting Component Labels

Process Metrics

Go Runtime Metrics

HTTP API Metrics

Request Metrics

Auth and Validation Metrics

Example Queries

Metric Filtering

Default Behavior

Re-enabling Metrics

Custom Metric Filtering

Available Metric Prefixes

Configuration Reference

Complete Telemetry Config

Authentication

Resource Attributes

Monitoring Backend Setup

Prometheus + Grafana

Grafana Cloud

Datadog

Elastic (ELK Stack)

New Relic

Honeycomb

Docker Compose Example

Troubleshooting

No Metrics Appearing

Authentication Errors

Metrics Have Wrong Names

High Memory Usage

Security Best Practices

Performance Considerations

Export Interval

Process Metrics Interval

Go Metrics

Next Steps