Tail Log Files With Rotation and Restart-Safe Checkpoints

This page collects five end-to-end pipelines that use the tail input to follow local log files. Each pipeline is a working configuration you can drop into a job spec; the surrounding processors and outputs are kept minimal so the focus stays on the tailing configuration.

For the full configuration surface, see the tail input reference.

1. Basic application log tail

The simplest possible deployment: follow every .log file under one directory, forward each line unchanged to stdout. On the first run, only newly appended lines are forwarded (start_at: end is the default). Every subsequent restart resumes from the persisted checkpoint, so the edge can be patched, restarted, or crash-rebooted with no data loss past whatever is already in flight.

input:
  tail:
    paths: [ /var/log/myapp/*.log ]

pipeline:
  processors: []

output:
  stdout: {}

What you get. One message per appended line. Each message carries file_path and file_name metadata identifying the source file, so a multi-file glob can be routed downstream by source. The checkpoint lives at <dataDir>/executions/<pipelineID>/state/tail/<hash-of-paths>/, anchored to the pipeline so re-deploying under a new pipeline ID starts fresh by design.

On rotation. When logrotate renames app.log → app.log.1 and creates a fresh app.log, the existing file is finished cleanly before its state is dropped, and the new file is picked up at offset 0. Up to three generations of rotated files are tracked concurrently, so a fast-rotating tail (size-based rotation under heavy write) never silently misses lines.

2. Multi-line Java stack traces

Java logs ship every stack trace as a header line plus dozens of indented frame lines. Treating each line as a message scatters a single error across many records and breaks downstream parsing. Use multiline to reassemble them.

input:
  tail:
    paths: [ /var/log/myapp/*.log ]
    multiline:
      line_start_pattern: '^\d{4}-\d{2}-\d{2}'

pipeline:
  processors:
    - mapping: |
        root.timestamp   = this.re_find_all_object("^(?P<ts>\\S+ \\S+)").index(0).ts.or("")
        root.level       = this.re_find_all_object("\\] (?P<lvl>[A-Z]+) ").index(0).lvl.or("INFO")
        root.message     = this
        root.source_file = metadata("file_path")

output:
  stdout: {}

What line_start_pattern does. Every line that does not match ^\d{4}-\d{2}-\d{2} (a YYYY-MM-DD timestamp at column zero) is appended to the previous entry. So:

2026-05-22 14:23:11 [main] ERROR Could not connect
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl...
        at java.base/java.net...
2026-05-22 14:23:12 [main] INFO Retry in 5s

becomes two messages: the ERROR (header + 3 stack frames) and the INFO. A trailing partial entry without a following match is held until the next match arrives, the file rotates, or the input shuts down.

start_at and partial first entries. Default start_at: end skips existing content; on first run the first multiline entry may be partial if the input attaches mid-message. Set start_at: beginning when capturing the whole file matters more than skipping the existing tail.

3. JSON logs with a text fallback

Many applications emit JSON logs in production but fall back to plain text in development or when something goes wrong before the logger initializes. Parse the JSON when present; preserve the text otherwise.

input:
  tail:
    paths: [ /var/log/myapp/*.log ]

pipeline:
  processors:
    - mapping: |
        # Try to parse JSON; fall back to wrapping the raw line as { "message": ... }.
        let parsed = this.parse_json()
        root = if parsed.exists() { parsed } else { { "message": this } }

        # Add tail-supplied identity to every record.
        root.ingested_at = now()
        root.source_file = metadata("file_path")

    - mapping: |
        # Drop empty / whitespace-only lines (common around log restarts).
        root = if this.message.or("").trim() == "" { deleted() }

output:
  stdout: {}

Drop ack semantics. Lines dropped with deleted() in the mapping processor still ack the input, so tail correctly advances its checkpoint past them. Only a true downstream nack (an output error, a mapping processor returning a non-nil error, etc.) holds the watermark. This is the behavior most pipelines want: dropping is not data loss.

4. Multiple log sources, routed by file_path

A single edge node often tails several applications at once. Globs and explicit paths can be combined in one paths list; downstream can discriminate by the per-message file_path and file_name metadata.

input:
  tail:
    paths:
      - /var/log/nginx/*.log
      - /var/log/myapp/server.log
      - /var/log/myapp/audit-*.log
    exclude:
      - /var/log/nginx/*.gz             # skip rotated archives

pipeline:
  processors:
    - mapping: |
        root.line        = this
        root.source_file = metadata("file_path")
        root.app = if metadata("file_path").contains("/nginx/") {
          "nginx"
        } else if metadata("file_path").contains("/audit-") {
          "audit"
        } else {
          "server"
        }

output:
  switch:
    cases:
      - check: this.app == "nginx"
        output:
          aws_s3:
            bucket: prod-edge-logs
            path: nginx/${! @file_name }-${! timestamp_unix() }.jsonl

      - check: this.app == "audit"
        output:
          file:
            path: /var/lib/expanso/audit-trail.jsonl
            codec: lines

      - check: this.app == "server"
        output:
          kafka:
            addresses: [ "broker:9092" ]
            topic: edge.server.logs

Why one input, not three. A single tail shares one polling loop, one in-memory queue, and one checkpoint key per (sorted) path set — lower memory, fewer goroutines, and a single restart resume point. Use multiple tail inputs only when you need genuinely independent checkpoint identities (different retention policies, separately resettable state).

5. Hardened production setup with explicit checkpoint identity

When the same log set is consumed by long-lived pipelines, set an explicit checkpoint_id so the checkpoint key is stable across config edits. Adding or removing files from paths then no longer invalidates the checkpoint, and you can bump the id deliberately to start fresh.

input:
  tail:
    paths:
      - /var/log/myapp/*.log
      - /var/log/myapp/sidecar-*.log

    # Explicit identity. Path edits keep the same checkpoint; bumping this
    # value (e.g. "myapp-logs-v1" → "myapp-logs-v2") starts fresh.
    checkpoint_id: myapp-logs-v1

    start_at: end
    poll_interval: 200ms
    encoding: utf-8
    max_log_size: 10MiB             # raise for known-large JSON blobs

    multiline:
      line_start_pattern: '^\d{4}-\d{2}-\d{2}'

    # Default: re-deliver nacked batches automatically. Set to false to make
    # a permanent downstream rejection visibly stall the input (no data loss,
    # but the pipeline halts until you fix the rejection).
    auto_replay_nacks: true

pipeline:
  processors:
    - mapping: |
        # Build a deterministic event id so a downstream replay can dedupe.
        root.event_id = (metadata("file_path") + ":" + content().hash("sha256").encode("hex")).slice(0, 40)
        root.line     = this
        root.source   = { "file": metadata("file_path"), "name": metadata("file_name") }
        root.ingested_at = now()

    - signature: {}             # sign every line with the edge node's identity

output:
  http_client:
    url: https://ingest.example.com/v1/logs
    verb: POST
    metadata:
      include_patterns:
        - "^expanso_"           # propagate the signature as transport headers

Why each piece matters.

checkpoint_id: myapp-logs-v1 — survives edits to paths. Without it, adding sidecar-*.log would change the implicit key (it's a hash of the sorted paths) and the input would re-tail from start_at.
max_log_size: 10MiB — raised from the 1 MiB default for known-large JSON-blob log lines. Entries larger than this are split at the boundary rather than dropped.
auto_replay_nacks: true — the default; explicit here for visibility. With it off, a permanent downstream rejection (e.g. a malformed message that always fails validation) would stall the checkpoint rather than silently dropping the data — a deliberate trade-off that surfaces in monitoring instead of losing lines.
Deterministic event_id from file_path + content hash — paired with a downstream store keyed by this id, this gives effective deduplication on top of tail's at-least-once delivery. A crash-induced re-delivery of an already-stored line is a no-op.
signature: {} after the mapping step — signs every line with the edge node's Ed25519 identity. Downstream verifies that the line came from the producing edge and was not altered after tail emitted it. See Sign Pipeline Messages.

Common variations

Read each log file from the beginning on first run

The default start_at: end skips existing content. To capture the whole file (e.g. for one-time ingest of an existing log directory):

input:
  tail:
    paths: [ /var/log/myapp/*.log ]
    start_at: beginning

Once a checkpoint exists, it always overrides start_at. Restarting with start_at: beginning still resumes from the last delivered byte — you cannot accidentally re-read a file just by toggling the field.

Tune the poll interval for high-throughput logs

The default poll_interval: 200ms is a fine middle ground. For latency-sensitive feeds (live alerting), drop it to 50ms; for chatty file systems where the polling cost matters (many tail inputs on one node), raise to 1s:

input:
  tail:
    paths: [ /var/log/critical-app/*.log ]
    poll_interval: 50ms

Read non-UTF-8 files

Legacy Windows logs and some European-locale systems still emit windows-1252 or latin-1. The encoding is decoded at the byte boundary before line splitting, so multibyte characters that straddle reads are handled correctly:

input:
  tail:
    paths: [ /var/log/legacy/*.log ]
    encoding: windows-1252

An unsupported encoding name fails at submit, not at pipeline start.

Group with a terminator instead of a header

For log formats where each entry ends with a known marker:

input:
  tail:
    paths: [ /var/log/myapp/transactions.log ]
    multiline:
      line_end_pattern: '^---END---$'

Only one of line_start_pattern or line_end_pattern may be set; the validator rejects configs that set both.

Next Steps

tail input reference — every field, default, validation rule, and the full at-least-once delivery / checkpoint contract.
signature processor — sign tailed log lines so downstream can verify the producing edge.
Bloblang guide — parse, filter, and reshape log lines.
Pipeline error handling — route lines that fail parsing to a dead-letter sink.
Quick Start — how to actually deploy and run these pipelines on an edge node.

1. Basic application log tail​

2. Multi-line Java stack traces​

3. JSON logs with a text fallback​

4. Multiple log sources, routed by file_path​

5. Hardened production setup with explicit checkpoint identity​

Common variations​

Read each log file from the beginning on first run​

Tune the poll interval for high-throughput logs​

Read non-UTF-8 files​

Group with a terminator instead of a header​

Next Steps​