S3 Intelligent Tiering
Route data to different S3 storage classes based on content, reducing storage costs by 60-80% while keeping critical data instantly accessible.
The Problem
Organizations storing logs, events, or telemetry in S3 often put everything into S3 Standard — even data that's rarely accessed. With 5-year retention requirements common for audit and compliance, this adds up fast:
| Storage Class | Cost (per TB/month) | Access Time |
|---|---|---|
| S3 Standard | ~$23.00 | Instant |
| S3 Standard-IA | ~$12.50 | Instant |
| S3 Glacier Instant Retrieval | ~$4.00 | Milliseconds |
| S3 Glacier Deep Archive | ~$1.00 | 12-48 hours |
Storing 100 TB for 5 years in Standard costs ~$138,000. The same data in Glacier Deep Archive costs ~$6,000. The challenge is routing the right data to the right tier — automatically, in real-time.
How Expanso Solves This
Instead of relying on S3 Lifecycle policies (which move data days or weeks after creation), Expanso Edge routes data to the correct storage class at write time. Your pipeline inspects each event and decides immediately: does this go to hot storage or cold archive?
Download & Run
Quick Start:
# Download and run directly
curl -sSL https://docs.expanso.io/examples/s3-intelligent-tiering.yaml | expanso-edge run -
# Or download first, customize, then run
curl -o my-pipeline.yaml https://docs.expanso.io/examples/s3-intelligent-tiering.yaml
expanso-edge run -f my-pipeline.yaml
Download: s3-intelligent-tiering.yaml
Complete Pipeline
This pipeline reads log events from Kafka, classifies them by severity, and routes each to the appropriate S3 storage class:
input:
kafka:
addresses:
- ${KAFKA_BROKERS:localhost:9092}
topics:
- application-logs
consumer_group: expanso-s3-tiering
pipeline:
processors:
# Parse the log event
- mapping: |
root = this.parse_json()
# Classify severity for tiering
- mapping: |
let level = this.level.lowercase().or("info")
root = this
root.storage_tier = match $level {
"error" => "hot",
"fatal" => "hot",
"warn" => "warm",
_ => "cold"
}
output:
switch:
cases:
# Errors and fatals → S3 Standard (instant access)
- check: this.storage_tier == "hot"
output:
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: hot/${!timestamp_format("2006/01/02/15")}/${!uuid_v4()}.json
storage_class: STANDARD
region: ${AWS_REGION:us-east-1}
batching:
count: 50
period: 10s
# Warnings → S3 Standard-IA (infrequent access)
- check: this.storage_tier == "warm"
output:
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: warm/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
storage_class: STANDARD_IA
region: ${AWS_REGION:us-east-1}
batching:
count: 500
period: 60s
# Everything else → Glacier Deep Archive
- check: this.storage_tier == "cold"
output:
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: archive/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
storage_class: DEEP_ARCHIVE
region: ${AWS_REGION:us-east-1}
batching:
count: 2000
period: 300s
Configuration Breakdown
Classification Logic
- mapping: |
let level = this.level.lowercase().or("info")
root = this
root.storage_tier = match $level {
"error" => "hot",
"fatal" => "hot",
"warn" => "warm",
_ => "cold"
}
This is where the cost savings happen. Each event is tagged with a storage_tier based on its severity. You control the rules — severity is just one approach. You could also classify by:
- Source application (billing logs stay hot, debug logs go cold)
- Age of data (last 24h hot, everything else archive)
- Content patterns (events matching a regex stay accessible)
Storage Class Routing
The switch output routes each event to the correct S3 destination. Notice how batching differs by tier:
- Hot (Standard): Small batches, frequent writes — you need this data fast
- Warm (Standard-IA): Medium batches — balances cost and accessibility
- Cold (Deep Archive): Large batches, infrequent writes — maximizes cost savings
The storage_class field on the S3 output tells AWS which tier to use at upload time — no lifecycle policies needed.
See: AWS S3 Output for all configuration options.
Batching for Cost Efficiency
Larger batches mean fewer S3 PUT requests and lower costs. Deep Archive data is batched aggressively (2000 events or 5 minutes) since you won't need immediate access:
batching:
count: 2000 # Batch up to 2000 events
period: 300s # Or flush every 5 minutes
See: Batching Guide for advanced batching patterns.
Cost Savings Example
Consider an organization processing 10 TB/month of logs with 5-year retention:
| Approach | Monthly Cost | 5-Year Cost |
|---|---|---|
| All S3 Standard | $230/TB × 10 TB = $2,300 | $138,000 |
| With intelligent tiering (80% cold, 15% warm, 5% hot) | $225 | $13,500 |
That's a ~90% reduction in storage costs — just by routing data to the right tier at write time.
Your savings depend on the ratio of hot/warm/cold data. Most organizations find that 70-90% of log data is rarely accessed after the first 24 hours, making this pattern highly effective.
Common Variations
Add Compression Before Archiving
Stack gzip compression with cold storage for maximum savings:
# In the cold storage output
output:
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: archive/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json.gz
storage_class: DEEP_ARCHIVE
content_encoding: gzip
region: ${AWS_REGION:us-east-1}
batching:
count: 2000
period: 300s
processors:
- compress:
algorithm: gzip
Dual-Write: Hot Summary + Cold Archive
Keep a summarized version in Standard for dashboards while archiving the full raw data:
output:
broker:
outputs:
# Full raw data → Deep Archive
- aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: raw/${!timestamp_format("2006/01/02")}/${!uuid_v4()}.json
storage_class: DEEP_ARCHIVE
region: ${AWS_REGION:us-east-1}
batching:
count: 2000
period: 300s
# Summarized data → Standard (for dashboards)
- processors:
- mapping: |
root.timestamp = this.timestamp
root.level = this.level
root.service = this.service
root.message = this.message.slice(0, 200)
aws_s3:
bucket: ${S3_BUCKET:my-logs}
path: summary/${!timestamp_format("2006/01/02/15")}/${!uuid_v4()}.json
storage_class: STANDARD
region: ${AWS_REGION:us-east-1}
batching:
count: 100
period: 30s
Use Kinesis Instead of Kafka
If your logs already flow through AWS Kinesis (common in multi-account AWS setups), swap the input:
input:
aws_kinesis:
streams:
- arn:aws:kinesis:us-east-1:123456789:stream/application-logs
dynamodb:
table: expanso-kinesis-checkpoints
The rest of the pipeline stays the same. See: Kinesis Input
Next Steps
- AWS S3 Output — Full S3 configuration options including
storage_class - Kafka to S3 — Simpler Kafka-to-S3 pipeline without tiering
- Batching Guide — Optimize batch sizes for cost and throughput
- Log Processing Use Case — Broader log processing patterns
- AWS Guide — AWS credentials and regional configuration