Skip to main content

S3 Intelligent Tiering

Route data to different S3 storage classes based on content, reducing storage costs by 60-80% while keeping critical data instantly accessible.

The Problem

Organizations storing logs, events, or telemetry in S3 often put everything into S3 Standard — even data that's rarely accessed. With 5-year retention requirements common for audit and compliance, this adds up fast:

Storage ClassCost (per TB/month)Access Time
S3 Standard~$23.00Instant
S3 Standard-IA~$12.50Instant
S3 Glacier Instant Retrieval~$4.00Milliseconds
S3 Glacier Deep Archive~$1.0012-48 hours

Storing 100 TB for 5 years in Standard costs ~$138,000. The same data in Glacier Deep Archive costs ~$6,000. The challenge is routing the right data to the right tier — automatically, in real-time.

How Expanso Solves This

Instead of relying on S3 Lifecycle policies (which move data days or weeks after creation), Expanso Edge routes data to the correct storage class at write time. Your pipeline inspects each event and decides immediately: does this go to hot storage or cold archive?

Download & Run

Quick Start:

# Download and run directly
curl -sSL https://docs.expanso.io/examples/s3-intelligent-tiering.yaml | expanso-edge run -

# Or download first, customize, then run
curl -o my-pipeline.yaml https://docs.expanso.io/examples/s3-intelligent-tiering.yaml
expanso-edge run -f my-pipeline.yaml

Download: s3-intelligent-tiering.yaml

Complete Pipeline

This pipeline reads log events from Kafka, classifies them by severity, and routes each to the appropriate S3 storage class:

input:
kafka:
addresses:
- ${KAFKA_BROKERS:localhost:9092}
topics:
- application-logs
consumer_group: expanso-s3-tiering

pipeline:
processors:
# Parse the log event
- mapping: |
root = this.parse_json()

# Classify severity for tiering
- mapping: |
let level = this.level.lowercase().or("info")
root = this
root.storage_tier = match $level {
"error" => "hot",
"fatal" => "hot",
"warn" => "warm",
_ => "cold"
}

output:
switch:
cases:
# Errors and fatals → S3 Standard (instant access)
- check: this.storage_tier == "hot"
output:
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: hot/${!timestamp_format("2006/01/02/15")}/${!uuid_v4()}.json
storage_class: STANDARD
region: ${AWS_REGION:us-east-1}
batching:
count: 50
period: 10s

# Warnings → S3 Standard-IA (infrequent access)
- check: this.storage_tier == "warm"
output:
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: warm/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
storage_class: STANDARD_IA
region: ${AWS_REGION:us-east-1}
batching:
count: 500
period: 60s

# Everything else → Glacier Deep Archive
- check: this.storage_tier == "cold"
output:
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: archive/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
storage_class: DEEP_ARCHIVE
region: ${AWS_REGION:us-east-1}
batching:
count: 2000
period: 300s

Configuration Breakdown

Classification Logic

- mapping: |
let level = this.level.lowercase().or("info")
root = this
root.storage_tier = match $level {
"error" => "hot",
"fatal" => "hot",
"warn" => "warm",
_ => "cold"
}

This is where the cost savings happen. Each event is tagged with a storage_tier based on its severity. You control the rules — severity is just one approach. You could also classify by:

  • Source application (billing logs stay hot, debug logs go cold)
  • Age of data (last 24h hot, everything else archive)
  • Content patterns (events matching a regex stay accessible)

Storage Class Routing

The switch output routes each event to the correct S3 destination. Notice how batching differs by tier:

  • Hot (Standard): Small batches, frequent writes — you need this data fast
  • Warm (Standard-IA): Medium batches — balances cost and accessibility
  • Cold (Deep Archive): Large batches, infrequent writes — maximizes cost savings

The storage_class field on the S3 output tells AWS which tier to use at upload time — no lifecycle policies needed.

See: AWS S3 Output for all configuration options.

Batching for Cost Efficiency

Larger batches mean fewer S3 PUT requests and lower costs. Deep Archive data is batched aggressively (2000 events or 5 minutes) since you won't need immediate access:

batching:
count: 2000 # Batch up to 2000 events
period: 300s # Or flush every 5 minutes

See: Batching Guide for advanced batching patterns.

Cost Savings Example

Consider an organization processing 10 TB/month of logs with 5-year retention:

ApproachMonthly Cost5-Year Cost
All S3 Standard$230/TB × 10 TB = $2,300$138,000
With intelligent tiering (80% cold, 15% warm, 5% hot)$225$13,500

That's a ~90% reduction in storage costs — just by routing data to the right tier at write time.

Real-world results vary

Your savings depend on the ratio of hot/warm/cold data. Most organizations find that 70-90% of log data is rarely accessed after the first 24 hours, making this pattern highly effective.

Common Variations

Add Compression Before Archiving

Stack gzip compression with cold storage for maximum savings:

# In the cold storage output
output:
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: archive/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json.gz
storage_class: DEEP_ARCHIVE
content_encoding: gzip
region: ${AWS_REGION:us-east-1}
batching:
count: 2000
period: 300s
processors:
- compress:
algorithm: gzip

Dual-Write: Hot Summary + Cold Archive

Keep a summarized version in Standard for dashboards while archiving the full raw data:

output:
broker:
outputs:
# Full raw data → Deep Archive
- aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: raw/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
storage_class: DEEP_ARCHIVE
region: ${AWS_REGION:us-east-1}
batching:
count: 2000
period: 300s

# Summarized data → Standard (for dashboards)
- processors:
- mapping: |
root.timestamp = this.timestamp
root.level = this.level
root.service = this.service
root.message = this.message.slice(0, 200)
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: summary/${!timestamp_format("2006/01/02/15")}/${!uuid_v4()}.json
storage_class: STANDARD
region: ${AWS_REGION:us-east-1}
batching:
count: 100
period: 30s

Use Kinesis Instead of Kafka

If your logs already flow through AWS Kinesis (common in multi-account AWS setups), swap the input:

input:
aws_kinesis:
streams:
- arn:aws:kinesis:us-east-1:123456789:stream/application-logs
dynamodb:
table: expanso-kinesis-checkpoints

The rest of the pipeline stays the same. See: Kinesis Input

Next Steps