What is Expanso and how does it work?

Expanso is a managed platform for deploying intelligent data pipelines at the edge. It processes data where it's generated - reducing bandwidth, latency, and costs. You deploy lightweight agents on your infrastructure, build pipelines using our visual builder or YAML, and control everything from a central SaaS platform.

Can I run AI/ML models directly in my data pipelines?

Yes! Expanso supports running ONNX, TensorFlow Lite, and other models as native pipeline steps. Execute low-latency inference on streaming data, enrich events with model outputs (like risk scores), and make decisions at the edge without cloud round-trips.

How many pre-built components are available?

Expanso provides 200+ pre-built components including inputs (Kafka, HTTP, files), processors (transformations, filtering, PII masking, aggregations), and outputs (S3, Snowflake, Datadog, Splunk). Browse the complete catalog in our Component Reference.

Do I need to write code to build pipelines?

No - use our drag-and-drop visual pipeline builder to create sophisticated pipelines without code. For advanced use cases, you can also write pipelines in YAML or use the Bloblang transformation language for complex data mappings.

How does Expanso help with data governance and compliance?

Expanso includes built-in governance features: automatic PII detection and masking, policy enforcement at the edge, RBAC, SSO integration, and comprehensive audit trails. Mask sensitive data before it ever leaves your network.

K3s Log Collection Best Practices

Configuration recommendations for reliable and efficient K3s log collection.

Always Add Node Identifiers

Include node context in every log to identify the source edge location:

pipeline:
  processors:
    - mapping: |
        root.node_id = env("NODE_ID")
        root.location = env("LOCATION")
        root.cluster = env("CLUSTER_NAME")

Why: Essential for filtering logs by location when managing 100+ edge sites.

Set environment variables:

export NODE_ID="edge-site-42"
export LOCATION="chicago"
export CLUSTER_NAME="k3s-chicago"

Use Batching for Cloud Destinations

Batch logs before sending to S3, Elasticsearch, or HTTP endpoints:

output:
  aws_s3:
    bucket: logs
    batching:
      count: 1000      # Batch size
      period: 1m       # Max wait time

Why: Reduces API calls by 1000x, lowering costs and improving performance.

Recommended batch sizes:

S3: 1000-5000 logs or 1-5 minutes
Elasticsearch: 100-500 logs or 10-30 seconds
HTTP: 100-1000 logs or 30-60 seconds

Set restart_on_exit: true

Always enable auto-restart for the kubectl subprocess:

input:
  subprocess:
    name: kubectl
    restart_on_exit: true  # Auto-restart if kubectl exits

Why: Ensures logs keep flowing if the kubectl process crashes or exits unexpectedly.

Handle Large Log Messages

Set maximum buffer size to prevent memory issues:

input:
  subprocess:
    name: kubectl
    max_buffer: 1048576  # 1MB max per log line

Why: Some applications generate very large log messages (stack traces, JSON payloads). Without a limit, these can cause memory issues.

Recommended sizes:

Standard logs: 524288 (512KB)
Large logs: 1048576 (1MB)
Very large: 2097152 (2MB)

Configure RBAC Permissions

Create a service account with minimal required permissions:

# Create service account
kubectl create serviceaccount expanso-logs

# Create role with log read permissions
kubectl create clusterrole log-reader \
  --verb=get,list,watch \
  --resource=pods,pods/log

# Bind role to service account
kubectl create clusterrolebinding expanso-logs \
  --clusterrole=log-reader \
  --serviceaccount=default:expanso-logs

Why: Follows principle of least privilege. Expanso only needs read access to logs, not write access to cluster resources.

Use the service account:

kubectl --as=system:serviceaccount:default:expanso-logs logs --follow

Filter Before Sending

Apply filters early in the pipeline to reduce downstream processing:

pipeline:
  processors:
    # Parse and filter FIRST
    - mapping: |
        root = this.parse_json().catch(deleted())

    # Only keep errors
    - switch:
        cases:
          - check: 'this.level == "error"'
            processors:
              - mapping: 'root = this'

    # Then add metadata (only for filtered logs)
    - mapping: |
        root.node_id = env("NODE_ID")

Why: Filtering early reduces CPU, memory, and network usage for logs that will be discarded anyway.

Monitor Log Pipeline Health

Add a metrics output to track pipeline performance:

output:
  broker:
    pattern: fan_out
    outputs:
      - aws_s3:
          bucket: logs

      - http_client:
          url: https://metrics.company.com
          verb: POST
          processors:
            - metric:
                type: counter
                name: logs_processed
                labels:
                  node_id: ${NODE_ID}

Why: Detect issues like log collection stopping, high error rates, or performance degradation.

Handle High-Volume Namespaces

For namespaces with very high log volume, use separate pipelines:

# High-volume namespace: aggressive filtering
expanso-edge run --config production-errors-only.yaml &

# Low-volume namespaces: collect everything
expanso-edge run --config staging-all-logs.yaml &

Why: Prevents high-volume namespaces from overwhelming the pipeline or hitting rate limits.

Use Connection Pooling for HTTP Outputs

Configure connection pooling for HTTP destinations:

output:
  http_client:
    url: https://logs.company.com/ingest
    max_in_flight: 64      # Parallel requests
    batching:
      count: 500
      period: 30s

Why: Improves throughput for HTTP-based log ingestion endpoints.

Troubleshooting Tips

Logs not appearing:

# Verify kubectl works
kubectl get pods --all-namespaces

# Check Expanso logs
expanso-edge run --config k3s-logs.yaml --log.level=debug

kubectl process exits:

Check restart_on_exit: true is set
Verify kubeconfig is valid
Check RBAC permissions

High memory usage:

Reduce max_buffer size
Add filtering to reduce log volume
Increase batching period

Performance issues:

Increase batch sizes
Add filtering earlier in pipeline
Use multiple parallel pipelines for different namespaces

Next Steps

Subprocess Input: Full component reference
Batching Guide: Deep dive into batching strategies
Error Handling: Handle pipeline failures

Always Add Node Identifiers​

Use Batching for Cloud Destinations​

Set restart_on_exit: true​

Handle Large Log Messages​

Configure RBAC Permissions​

Filter Before Sending​

Monitor Log Pipeline Health​

Handle High-Volume Namespaces​

Use Connection Pooling for HTTP Outputs​

Troubleshooting Tips​

Next Steps​