Offline-Resilient Configuration
Handle intermittent network connectivity in edge environments with local buffering and automatic retry.
Pipeline
input:
subprocess:
name: oc
args: [logs, --all-containers, --prefix, --follow, --all-namespaces]
codec: lines
restart_on_exit: true
pipeline:
processors:
- mapping: |
root = this
root.cluster = env("CLUSTER_NAME")
root.timestamp = now()
# Buffer for offline periods
buffer:
system_window:
timestamp_mapping: 'root = this.timestamp'
size: 1h
output:
retry:
max_retries: 10
backoff:
initial_interval: 30s
max_interval: 10m
output:
aws_s3:
bucket: sno-logs
path: 'logs/${! env("CLUSTER_NAME") }/${! timestamp_unix() }.jsonl'
batching:
count: 5000
period: 5m
What This Does
- Local buffering: Queues up to 1 hour of logs in memory during network outages
- Automatic retry: Retries failed S3 writes up to 10 times
- Exponential backoff: Starts with 30s delay, increases to 10m maximum
- Seamless recovery: Automatically catches up when connectivity returns
- No data loss: Logs are not dropped during temporary outages
Buffer Configuration
Time-based window:
buffer:
system_window:
timestamp_mapping: 'root = this.timestamp'
size: 1h # Buffer 1 hour of data
Size-based window:
buffer:
system_window:
timestamp_mapping: 'root = this.timestamp'
size: 100MB # Buffer 100MB of data
Adjust based on:
- Expected outage duration
- Log volume
- Available memory
Retry Strategy
Current settings:
- Max retries: 10
- Initial interval: 30s
- Max interval: 10m
Retry schedule:
- 30s
- 1m
- 2m
- 4m
- 8m
- 10m (max) 7-10. 10m each
Total retry time: ~60 minutes before giving up
Customization Examples
Shorter retry window (for frequent, short outages):
output:
retry:
max_retries: 5
backoff:
initial_interval: 10s
max_interval: 2m
Longer retry window (for extended outages):
output:
retry:
max_retries: 20
backoff:
initial_interval: 1m
max_interval: 30m
Memory Considerations
Buffer memory usage:
- 1h of logs at 1000 logs/min = ~60,000 logs
- Average log size 500 bytes = ~30MB memory
For resource-constrained SNO:
buffer:
system_window:
timestamp_mapping: 'root = this.timestamp'
size: 30m # Smaller buffer for limited memory
Network Failure Scenarios
Scenario 1: Short outage (5 minutes)
- Logs buffer locally
- S3 write fails, retry after 30s
- Connection restored, retry succeeds
- All logs delivered
Scenario 2: Extended outage (2 hours)
- Logs buffer for 1 hour (buffer limit)
- Older logs are dropped to make room for new ones
- Connection restored after 2 hours
- Last hour of logs delivered
Scenario 3: Persistent failure
- All retries exhausted after ~60 minutes
- Logs are dropped
- Error logged for monitoring
Monitoring Buffer Health
Add metrics to track buffer usage:
pipeline:
processors:
- mapping: |
root = this
root.cluster = env("CLUSTER_NAME")
root.timestamp = now()
root.buffer_size = metadata("buffer_size").or(0)
Best Practices
Right-size buffer: Match to expected outage duration and available memory
Monitor retries: Track failed deliveries to identify persistent network issues
Combine with batching: Larger batches reduce network overhead when connectivity returns
Test offline behavior: Simulate network outages to verify recovery
Next Steps
- Best Practices: Additional resilience recommendations
- Collect Logs: Add buffering to basic log collection
- Buffer Component: Component reference