Skip to main content

Offline-Resilient Configuration

Handle intermittent network connectivity in edge environments with local buffering and automatic retry.

Pipeline

input:
subprocess:
name: oc
args: [logs, --all-containers, --prefix, --follow, --all-namespaces]
codec: lines
restart_on_exit: true

pipeline:
processors:
- mapping: |
root = this
root.cluster = env("CLUSTER_NAME")
root.timestamp = now()

# Buffer for offline periods
buffer:
system_window:
timestamp_mapping: 'root = this.timestamp'
size: 1h

output:
retry:
max_retries: 10
backoff:
initial_interval: 30s
max_interval: 10m
output:
aws_s3:
bucket: sno-logs
path: 'logs/${! env("CLUSTER_NAME") }/${! timestamp_unix() }.jsonl'
batching:
count: 5000
period: 5m

What This Does

  • Local buffering: Queues up to 1 hour of logs in memory during network outages
  • Automatic retry: Retries failed S3 writes up to 10 times
  • Exponential backoff: Starts with 30s delay, increases to 10m maximum
  • Seamless recovery: Automatically catches up when connectivity returns
  • No data loss: Logs are not dropped during temporary outages

Buffer Configuration

Time-based window:

buffer:
system_window:
timestamp_mapping: 'root = this.timestamp'
size: 1h # Buffer 1 hour of data

Size-based window:

buffer:
system_window:
timestamp_mapping: 'root = this.timestamp'
size: 100MB # Buffer 100MB of data

Adjust based on:

  • Expected outage duration
  • Log volume
  • Available memory

Retry Strategy

Current settings:

  • Max retries: 10
  • Initial interval: 30s
  • Max interval: 10m

Retry schedule:

  1. 30s
  2. 1m
  3. 2m
  4. 4m
  5. 8m
  6. 10m (max) 7-10. 10m each

Total retry time: ~60 minutes before giving up

Customization Examples

Shorter retry window (for frequent, short outages):

output:
retry:
max_retries: 5
backoff:
initial_interval: 10s
max_interval: 2m

Longer retry window (for extended outages):

output:
retry:
max_retries: 20
backoff:
initial_interval: 1m
max_interval: 30m

Memory Considerations

Buffer memory usage:

  • 1h of logs at 1000 logs/min = ~60,000 logs
  • Average log size 500 bytes = ~30MB memory

For resource-constrained SNO:

buffer:
system_window:
timestamp_mapping: 'root = this.timestamp'
size: 30m # Smaller buffer for limited memory

Network Failure Scenarios

Scenario 1: Short outage (5 minutes)

  • Logs buffer locally
  • S3 write fails, retry after 30s
  • Connection restored, retry succeeds
  • All logs delivered

Scenario 2: Extended outage (2 hours)

  • Logs buffer for 1 hour (buffer limit)
  • Older logs are dropped to make room for new ones
  • Connection restored after 2 hours
  • Last hour of logs delivered

Scenario 3: Persistent failure

  • All retries exhausted after ~60 minutes
  • Logs are dropped
  • Error logged for monitoring

Monitoring Buffer Health

Add metrics to track buffer usage:

pipeline:
processors:
- mapping: |
root = this
root.cluster = env("CLUSTER_NAME")
root.timestamp = now()
root.buffer_size = metadata("buffer_size").or(0)

Best Practices

Right-size buffer: Match to expected outage duration and available memory

Monitor retries: Track failed deliveries to identify persistent network issues

Combine with batching: Larger batches reduce network overhead when connectivity returns

Test offline behavior: Simulate network outages to verify recovery

Next Steps