file
Consumes data from files on disk, emitting messages according to a chosen codec.
When to Use
Use file input when you need to:
- Read log files from
/var/log/or application directories - Process CSV/JSON files dropped into a directory
- Tail files like
tail -ffor continuous streaming
Don't use this if:
- Files are on S3/GCS — use
aws_s3orgcp_cloud_storage - You're reading from SFTP — use
sftp - You need stdin — use
stdin
Common Patterns
Tail Log Files
Watch for new lines continuously:
input:
file:
paths: ["/var/log/app/*.log"]
scanner:
lines: {}
Process All JSON Files
Read complete files as documents:
input:
file:
paths: ["/data/incoming/*.json"]
scanner:
json_documents: {}
CSV Processing
Parse CSV with headers:
input:
file:
paths: ["/data/*.csv"]
scanner:
csv: {}
- Common
- Advanced
# Common config fields, showing default values
input:
label: ""
file:
paths: [] # No default (required)
scanner:
lines: {}
auto_replay_nacks: true
# All config fields, showing default values
input:
label: ""
file:
paths: [] # No default (required)
scanner:
lines: {}
delete_on_finish: false
auto_replay_nacks: true
Metadata
This input adds the following metadata fields to each message:
- path
- mod_time_unix
- mod_time (RFC3339)
You can access these metadata fields using function interpolation.
Fields
paths
A list of paths to consume sequentially. Glob patterns are supported, including super globs (double star).
Type: array
scanner
The scanner by which the stream of bytes consumed will be broken out into individual messages. Scanners are useful for processing large sources of data without holding the entirety of it within memory. For example, the csv scanner allows you to process individual CSV rows without loading the entire CSV file in memory at once.
Type: scanner
Default: {"lines":{}}
delete_on_finish
Whether to delete input files from the disk once they are fully consumed.
Type: bool
Default: false
auto_replay_nacks
Whether messages that are rejected (nacked) at the output level should be automatically replayed indefinitely, eventually resulting in back pressure if the cause of the rejections is persistent. If set to false these messages will instead be deleted. Disabling auto replays can greatly improve memory efficiency of high throughput streams as the original shape of the data can be discarded immediately upon consumption and mutation.
Type: bool
Default: true
Examples
- Read a Bunch of CSVs
If we wished to consume a directory of CSV files as structured documents we can use a glob pattern and the csv scanner:
input:
file:
paths: [ ./data/*.csv ]
scanner:
csv: {}