Skip to main content

tail

Follows local files (glob + rotation) and emits one message per line. Unlike the file input — which is line-streaming but does not survive file rotation or process restarts safely — tail uses content fingerprinting to track file identity across rename and inode reuse, and persists an ack-gated checkpoint so a restarted edge resumes from the last delivered byte. Delivery is at-least-once: the checkpoint never advances past undelivered data, so a crash re-reads only the un-acked tail.

input:
tail:
paths: [ /var/log/myapp/*.log ]
start_at: end
multiline:
line_start_pattern: '^\d{4}-\d{2}-\d{2}'

The tail input is the right choice for production log tailing on the edge: it handles rotation (logrotate, k8s log rotation, daily-named files), follows files across renames, groups multi-line stack traces or events into single messages, and recovers cleanly from edge restarts with no silent data loss.

When to Use

Use the tail input when you need to:

  • Tail application or system log files that rotate or roll over by date.
  • Survive edge restarts without data loss — tail resumes from the persisted checkpoint, not from the current end-of-file.
  • Group multi-line entries — Java stack traces, JSON-on-multiple-lines, ASCII-art banners — into single messages with multiline.
  • Read files in non-UTF-8 encodingslatin-1, utf-16, and other text encodings are supported via encoding.
  • Discover new files dropped into a directory at runtime — globs are re-evaluated on every poll_interval.

Don't use this if:

  • You need to read completed files once (load-then-finish), rather than continuously following — use the file input with scanner: lines.
  • The files live on S3 / GCS / Azure — use the cloud-storage inputs (aws_s3, gcp_cloud_storage).
  • You need exactly-once delivery — tail is at-least-once. A crash between the downstream ack and the checkpoint flush will re-deliver the in-flight batch. Plan for downstream idempotence (deduplication key, signed events, an append-only store with a unique constraint).

Common Patterns

Tail Application Logs

Watch for new lines continuously, skipping anything written before the input started:

input:
tail:
paths: [ /var/log/myapp/*.log ]

The default start_at: end means existing content is skipped on first run; only newly appended lines are forwarded. On every subsequent restart, the persisted checkpoint takes over and reading resumes from the last delivered byte regardless of start_at.

Group Java Stack Traces

Match the first line of each entry with a timestamp regex:

input:
tail:
paths: [ /var/log/myapp/*.log ]
start_at: beginning
multiline:
line_start_pattern: '^\d{4}-\d{2}-\d{2}'

Every line that does not match the pattern is appended to the previous entry. A 50-line stack trace following a timestamped header becomes one message.

Read Multiple Log Sources With One Input

Globs and explicit paths can be mixed; matched files are deduplicated:

input:
tail:
paths:
- /var/log/nginx/*.log
- /var/log/myapp/server.log
- /var/log/myapp/audit-*.log
exclude:
- /var/log/nginx/*.gz

Configuration

# Common config fields, showing default values.
input:
tail:
paths: [] # required — glob patterns or absolute paths
start_at: end # "beginning" | "end"
multiline: # optional
line_start_pattern: "" # set EITHER start OR end, not both
line_end_pattern: ""

Fields

FieldTypeDefaultDescription
pathslist<string>Required. One or more glob patterns or absolute paths of files to tail (e.g. /var/log/app/*.log, /srv/data/audit.log). At least one path must be supplied.
excludelist<string>[]Glob patterns to skip. Useful for excluding rotated archives like *.gz from a directory glob.
start_atenumendWhere to begin reading when no checkpoint exists for a file. beginning reads the file from the start; end skips existing content and reads only new lines appended after the input starts. Once a checkpoint exists, it always takes precedence over start_at.
poll_intervalduration200msHow often to check for new data and new/rotated files. Smaller values reduce ingest latency at the cost of CPU; values from 50ms to 1s are typical.
multiline.line_start_patternstring""Regex that matches the first line of each logical log entry. Lines that do not match are appended to the previous entry. Mutually exclusive with line_end_pattern.
multiline.line_end_patternstring""Regex that matches the last line of each logical log entry. Lines after a match start a new entry. Mutually exclusive with line_start_pattern.
encodingstringutf-8Character encoding of the source files. Accepts common encodings (utf-8, utf-16, latin-1, windows-1252, etc.). An unsupported value fails at parse time, not at runtime.
max_log_sizebyte size1MiBMaximum size of a single log entry. Entries exceeding this size are split at the boundary. Accepts 512KiB, 1MiB, 10MiB, etc.
checkpoint_idstring""Advanced: explicit checkpoint identity. If empty, the checkpoint key is derived from the sorted, normalized paths — adding/removing files in the glob does not invalidate the checkpoint. Set this to take manual control (v1v2 to start fresh) or when two tail inputs in the same pipeline must share state. See Checkpointing below.
auto_replay_nacksbooltrueWhen a downstream component nacks a batch, automatically re-deliver it. With this off, a permanently nacked batch intentionally stalls the input's checkpoint (no data is silently dropped). See Nack and at-least-once delivery.

The submission validator rejects unknown fields, an empty paths list, an invalid encoding value, multiline configs that set both patterns, and max_log_size values that fail to parse — every error is caught at submit time rather than at pipeline start.

Metadata

Every emitted message carries two metadata fields populated from the source file:

Metadata keyDescription
file_pathAbsolute path of the file the line was read from.
file_nameBase name of the file (e.g. server.log).

Reference them with the metadata("file_path") Bloblang function, or with @file_path / @file_name:

pipeline:
processors:
- mapping: |
root.source_file = metadata("file_path")
root.app = @file_name.split(".").index(0) # filename → app id

When the pipeline reads from multiple log sources via one tail, these keys are the primary way to discriminate downstream.

Rotation and file identity

tail does not match files by name alone. It computes a content fingerprint from each file's first kilobyte, which gives every distinct file a stable identity. The fingerprint is what survives rotation:

  • Rename rotation (server.logserver.log.1, fresh server.log created): the renamed file keeps its identity and is read to its current end; the new server.log becomes a new tracked file at offset 0.
  • Copy-truncate rotation (logrotate copies then truncates): the truncated file is recognized as a new file (its first kilobyte changed) and is read from the start.
  • Inode reuse (the OS reassigns the inode of a deleted file to a new file with the same name): the fingerprint differs from the prior file, so the new file is read from the start without losing offset state for the old one.

Up to three generations of rotated files are tracked at once, so a file that rotated mid-read (server.logserver.log.1server.log.2) is still finished cleanly before its state is dropped. The rotation-handling engine underneath tail is a battle-tested file-follower used widely in industrial log collection — tail wraps it with ack-gated checkpointing and exposes it through the Expanso Edge configuration surface.

Checkpointing

tail persists checkpoint state under the edge data directory:

<dataDir>/executions/<pipelineID>/state/tail/<checkpoint-key>/

The checkpoint key is one of:

  • sanitize(checkpoint_id) when you set the field explicitly. Sanitization makes the key path-safe (no directory traversal).
  • A 128-bit SHA-256 prefix of the sorted, separator-normalized paths when checkpoint_id is empty. Path separators are normalized to / before hashing so a config authored on one OS produces the same key when run on another (a config never silently loses its checkpoint by moving between Linux and Windows agents).

When state is written

Checkpoints are ack-driven, not time-driven. The flow is:

  1. The polling loop emits new lines as a batch, tagged with an internal monotonic sequence number.
  2. The batch is delivered to the next pipeline stage. The checkpoint state for that batch is held pending in memory.
  3. Downstream confirms delivery with an ack (filtered messages ack normally — they correctly advance the checkpoint).
  4. A background flusher writes the latest in-memory snapshot whose tagged sequence is strictly below the lowest un-acked batch. Older snapshots are discarded.

The result is that the on-disk checkpoint never moves past undelivered data. On a clean shutdown the final poll completes, the queue drains, and the checkpoint reflects everything that was acked. On a crash, the next start re-reads only the un-acked tail.

Writes are atomic and fsynced

Each checkpoint file is written as:

tmp file → write → fsync (same fd) → rename → fsync(dir)

A power loss between the rename and the next start either yields the old checkpoint (no rename committed) or the new one (rename committed and dir-fsynced) — never a half-written file. A genuinely undecodable checkpoint (corrupt blob, mid-write tear caught by a checksum failure) is discarded and the file is re-tailed per start_at; the input does not brick on bad state.

Sharing or resetting state

  • Share state across config edits. Set checkpoint_id: "logs-v1" explicitly. Adding or removing entries from paths no longer changes the checkpoint, so the input continues from where it left off.
  • Start fresh. Bump the explicit checkpoint_id (e.g. logs-v1logs-v2). The new key starts with no state and reads each file from start_at.
  • Migrate to a new pipeline. Checkpoints are anchored under the pipeline ID, so deploying the same tail config under a new pipeline starts fresh by design. To carry state forward, copy the checkpoint directory between the old and new pipeline state paths.

Nack and at-least-once delivery

A message that is nacked (any downstream component reports an error instead of a successful ack) does not advance the checkpoint. The checkpoint never moves past undelivered data — this is the at-least-once contract.

What happens after a nack depends on auto_replay_nacks:

  • true (default). The pipeline runtime redelivers the nacked batch from memory until it acks successfully. The user-visible behavior is: stable backpressure, no data loss, no checkpoint stall.
  • false. A permanently rejected batch intentionally stalls the input's checkpoint rather than silently dropping the data. Use this when you would rather have a stuck pipeline (which surfaces in monitoring) than lose data in the rare case of a permanent downstream rejection. The stall is observable as a non-advancing checkpoint and growing in-memory queue depth.

Consumers should plan for occasional duplicates on restart — a crash between an ack and the checkpoint flush can re-deliver a batch. For deduplication, pair tail with the signature processor and a downstream store keyed by content hash, or with a Bloblang mapping that builds a deterministic event ID from the source file path, line content, and a timestamp.

Multiline grouping

Many log formats produce entries that span multiple physical lines — Java stack traces, structured-then-formatted JSON, ASCII-art startup banners. multiline reassembles them into one logical message before the line reaches the pipeline.

Set one of the two patterns (the validator rejects both being set):

  • line_start_pattern — regex that matches the first line of each entry. Every subsequent non-matching line is appended to the in-flight entry. Best when entries reliably begin with a timestamp or a level prefix.

    multiline:
    line_start_pattern: '^\d{4}-\d{2}-\d{2}' # "2026-05-22 ..." starts an entry
  • line_end_pattern — regex that matches the last line of each entry. The line following a match begins a new entry. Useful when entries have a known terminator (e.g. a </request> marker) but no consistent prefix.

    multiline:
    line_end_pattern: '^---END---$'

A trailing partial entry (no terminator yet) is held until the next match arrives, the file rotates, or the input shuts down. On first run with start_at: end, the first multi-line entry may be partial until the next entry begins; switch to start_at: beginning if you need to capture the whole file.

Encoding and large entries

encoding accepts any IANA-registered text encoding name resolvable by Go's golang.org/x/text/encoding/ianaindex: utf-8, utf-16, utf-16be, utf-16le, latin-1 / iso-8859-1, windows-1252, gbk, and the like. The encoding is decoded at the byte boundary before lines are split, so multibyte characters that straddle reads are handled correctly. An unsupported value fails at parse time.

max_log_size caps a single logical entry. An entry longer than this is split at the boundary and the remainder is forwarded as a separate message. Tune this for known-bursty entries: 1MiB is fine for most application logs; raise it (10MiB) when JSON-blob log lines or stack-trace dumps can legitimately exceed a megabyte.

  • Worked examples → — five end-to-end pipelines covering a basic tail, multi-line stack traces, JSON-with-fallback parsing, multi-file routing by source, and a hardened production setup with explicit checkpoint identity.
  • file input — load-then-finish file ingestion; use when you don't need rotation handling or checkpoint resume.
  • signature processor — sign every line on its way out, so downstream consumers can verify the line came from the tailing edge node and was not altered.
  • mapping / Bloblang — extract structured fields from log lines, build deterministic event IDs for downstream deduplication.
  • Pipeline error handling — route lines that fail parsing to a dead-letter sink.

Limitations

  • At-least-once, not exactly-once. A crash between the downstream ack and the checkpoint flush re-delivers the in-flight batch. Plan for downstream idempotence.
  • One writer per checkpoint key. Two tail inputs in the same process pointing at the same checkpoint_id (or, with the implicit key, the same sorted paths) write to the same on-disk state and will race. Give each its own checkpoint_id, or fold them into one input with both globs.
  • Local files only. tail reads from the local filesystem of the edge node. For network shares, mount them locally; for object stores, use the dedicated cloud-storage inputs.
  • No exactly-once tail of remote logs. If you need cryptographic guarantees that every line is delivered exactly once across an edge restart and network partition, combine tail with a downstream content-addressable store and a deduplication processor — neither responsibility belongs to the input itself.