What is Expanso and how does it work?

Expanso is a managed platform for deploying intelligent data pipelines at the edge. It processes data where it's generated - reducing bandwidth, latency, and costs. You deploy lightweight agents on your infrastructure, build pipelines using our visual builder or YAML, and control everything from a central SaaS platform.

Can I run AI/ML models directly in my data pipelines?

Yes! Expanso supports running ONNX, TensorFlow Lite, and other models as native pipeline steps. Execute low-latency inference on streaming data, enrich events with model outputs (like risk scores), and make decisions at the edge without cloud round-trips.

How many pre-built components are available?

Expanso provides 200+ pre-built components including inputs (Kafka, HTTP, files), processors (transformations, filtering, PII masking, aggregations), and outputs (S3, Snowflake, Datadog, Splunk). Browse the complete catalog in our Component Reference.

Do I need to write code to build pipelines?

No - use our drag-and-drop visual pipeline builder to create sophisticated pipelines without code. For advanced use cases, you can also write pipelines in YAML or use the Bloblang transformation language for complex data mappings.

How does Expanso help with data governance and compliance?

Expanso includes built-in governance features: automatic PII detection and masking, policy enforcement at the edge, RBAC, SSO integration, and comprehensive audit trails. Mask sensitive data before it ever leaves your network.

S3 Intelligent Tiering

Route data to different S3 storage classes based on content, reducing storage costs by 60-80% while keeping critical data instantly accessible.

The Problem

Organizations storing logs, events, or telemetry in S3 often put everything into S3 Standard — even data that's rarely accessed. With 5-year retention requirements common for audit and compliance, this adds up fast:

Storage Class	Cost (per TB/month)	Access Time
S3 Standard	~$23.00	Instant
S3 Standard-IA	~$12.50	Instant
S3 Glacier Instant Retrieval	~$4.00	Milliseconds
S3 Glacier Deep Archive	~$1.00	12-48 hours

Storing 100 TB for 5 years in Standard costs ~$138,000. The same data in Glacier Deep Archive costs ~$6,000. The challenge is routing the right data to the right tier — automatically, in real-time.

How Expanso Solves This

Instead of relying on S3 Lifecycle policies (which move data days or weeks after creation), Expanso Edge routes data to the correct storage class at write time. Your pipeline inspects each event and decides immediately: does this go to hot storage or cold archive?

Download & Run

Quick Start:

# Download and run directly
curl -sSL https://docs.expanso.io/examples/s3-intelligent-tiering.yaml | expanso-edge run -

# Or download first, customize, then run
curl -o my-pipeline.yaml https://docs.expanso.io/examples/s3-intelligent-tiering.yaml
expanso-edge run -f my-pipeline.yaml

Download: s3-intelligent-tiering.yaml

Complete Pipeline

This pipeline reads log events from Kafka, classifies them by severity, and routes each to the appropriate S3 storage class:

input:
  kafka:
    addresses:
      - ${KAFKA_BROKERS:localhost:9092}
    topics:
      - application-logs
    consumer_group: expanso-s3-tiering

pipeline:
  processors:
    # Parse the log event
    - mapping: |
        root = this.parse_json()

    # Classify severity for tiering
    - mapping: |
        let level = this.level.lowercase().or("info")
        root = this
        root.storage_tier = match $level {
          "error" => "hot",
          "fatal" => "hot",
          "warn"  => "warm",
          _       => "cold"
        }

output:
  switch:
    cases:
      # Errors and fatals → S3 Standard (instant access)
      - check: this.storage_tier == "hot"
        output:
          aws_s3:
            bucket: ${S3_BUCKET:my-logs}
            path: hot/${!timestamp_format("2006/01/02/15")}/${!uuid_v4()}.json
            storage_class: STANDARD
            region: ${AWS_REGION:us-east-1}
            batching:
              count: 50
              period: 10s

      # Warnings → S3 Standard-IA (infrequent access)
      - check: this.storage_tier == "warm"
        output:
          aws_s3:
            bucket: ${S3_BUCKET:my-logs}
            path: warm/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
            storage_class: STANDARD_IA
            region: ${AWS_REGION:us-east-1}
            batching:
              count: 500
              period: 60s

      # Everything else → Glacier Deep Archive
      - check: this.storage_tier == "cold"
        output:
          aws_s3:
            bucket: ${S3_BUCKET:my-logs}
            path: archive/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
            storage_class: DEEP_ARCHIVE
            region: ${AWS_REGION:us-east-1}
            batching:
              count: 2000
              period: 300s

Configuration Breakdown

Classification Logic

- mapping: |
    let level = this.level.lowercase().or("info")
    root = this
    root.storage_tier = match $level {
      "error" => "hot",
      "fatal" => "hot",
      "warn"  => "warm",
      _       => "cold"
    }

This is where the cost savings happen. Each event is tagged with a storage_tier based on its severity. You control the rules — severity is just one approach. You could also classify by:

Source application (billing logs stay hot, debug logs go cold)
Age of data (last 24h hot, everything else archive)
Content patterns (events matching a regex stay accessible)

Storage Class Routing

The switch output routes each event to the correct S3 destination. Notice how batching differs by tier:

Hot (Standard): Small batches, frequent writes — you need this data fast
Warm (Standard-IA): Medium batches — balances cost and accessibility
Cold (Deep Archive): Large batches, infrequent writes — maximizes cost savings

The storage_class field on the S3 output tells AWS which tier to use at upload time — no lifecycle policies needed.

See: AWS S3 Output for all configuration options.

Batching for Cost Efficiency

Larger batches mean fewer S3 PUT requests and lower costs. Deep Archive data is batched aggressively (2000 events or 5 minutes) since you won't need immediate access:

batching:
  count: 2000    # Batch up to 2000 events
  period: 300s   # Or flush every 5 minutes

See: Batching Guide for advanced batching patterns.

Cost Savings Example

Consider an organization processing 10 TB/month of logs with 5-year retention:

Approach	Monthly Cost	5-Year Cost
All S3 Standard	$230/TB × 10 TB = $2,300	$138,000
With intelligent tiering (80% cold, 15% warm, 5% hot)	$225	$13,500

That's a ~90% reduction in storage costs — just by routing data to the right tier at write time.

Real-world results vary

Your savings depend on the ratio of hot/warm/cold data. Most organizations find that 70-90% of log data is rarely accessed after the first 24 hours, making this pattern highly effective.

Common Variations

Add Compression Before Archiving

Stack gzip compression with cold storage for maximum savings:

# In the cold storage output
output:
  aws_s3:
    bucket: ${S3_BUCKET:my-logs}
    path: archive/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json.gz
    storage_class: DEEP_ARCHIVE
    content_encoding: gzip
    region: ${AWS_REGION:us-east-1}
    batching:
      count: 2000
      period: 300s
      processors:
        - compress:
            algorithm: gzip

Dual-Write: Hot Summary + Cold Archive

Keep a summarized version in Standard for dashboards while archiving the full raw data:

output:
  broker:
    outputs:
      # Full raw data → Deep Archive
      - aws_s3:
          bucket: ${S3_BUCKET:my-logs}
          path: raw/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
          storage_class: DEEP_ARCHIVE
          region: ${AWS_REGION:us-east-1}
          batching:
            count: 2000
            period: 300s

      # Summarized data → Standard (for dashboards)
      - processors:
          - mapping: |
              root.timestamp = this.timestamp
              root.level = this.level
              root.service = this.service
              root.message = this.message.slice(0, 200)
        aws_s3:
          bucket: ${S3_BUCKET:my-logs}
          path: summary/${!timestamp_format("2006/01/02/15")}/${!uuid_v4()}.json
          storage_class: STANDARD
          region: ${AWS_REGION:us-east-1}
          batching:
            count: 100
            period: 30s

Use Kinesis Instead of Kafka

If your logs already flow through AWS Kinesis (common in multi-account AWS setups), swap the input:

input:
  aws_kinesis:
    streams:
      - arn:aws:kinesis:us-east-1:123456789:stream/application-logs
    dynamodb:
      table: expanso-kinesis-checkpoints

The rest of the pipeline stays the same. See: Kinesis Input

Next Steps

AWS S3 Output — Full S3 configuration options including storage_class
Kafka to S3 — Simpler Kafka-to-S3 pipeline without tiering
Batching Guide — Optimize batch sizes for cost and throughput
Log Processing Use Case — Broader log processing patterns
AWS Guide — AWS credentials and regional configuration

The Problem​

How Expanso Solves This​

Download & Run​

Complete Pipeline​

Configuration Breakdown​

Classification Logic​

Storage Class Routing​

Batching for Cost Efficiency​

Cost Savings Example​

Common Variations​

Add Compression Before Archiving​

Dual-Write: Hot Summary + Cold Archive​

Use Kinesis Instead of Kafka​

Next Steps​