Skip to main content

Data Lineage

Overview

Regulated, multi-region, and high-trust deployments need to prove what their data pipelines did — which inputs were read, which outputs were written, when the run started and ended, and whether it completed cleanly. Centralized lineage tools answer this only if every system contributing data feeds them events. When pipelines run at the edge — across thousands of nodes, in disconnected sites, under regional sovereignty rules — the lineage trail starts at the edge, not at the central catalog.

Expanso's Approach to Data Lineage

Expanso Edge emits signed OpenLineage events natively, directly from the node where each pipeline ran. A lifecycle event fires at every transition (START, COMPLETE, FAIL, ABORT) and is delivered to any OpenLineage-compatible backend — Marquez, data catalogs, or custom collectors — over HTTP or persisted to disk for air-gapped collection.

Key capabilities:

  • Native emission at the source. Events fire from the edge agent running the pipeline, not from a centralized scrubbing job — so the lineage record matches the operational truth even when nodes are offline, regionally segregated, or running under data-sovereignty rules.
  • Signed events. Every event carries an Ed25519 signature over the canonical event bytes. Downstream consumers verify both producer identity and payload integrity, supporting tamper-evident audit trails.
  • Schema-conformant. Events validate against the OpenLineage 2.0.2 RunEvent schema and drop into Marquez or any OpenLineage-aware tool without translation.
  • Two transports. HTTP delivery for connected sites; JSON-Lines file output for air-gapped collection, shipped onward by your existing log infrastructure.
  • Dataset extraction. Input and output components are mapped to OpenLineage Datasets automatically (Kafka topics, S3 buckets, files, SQL tables, HTTP endpoints).
  • Pipeline-safe. Lineage emission is fully asynchronous; failures route to the node's health surface and never block, fail, or slow the pipeline.
  • Aligned message metadata. The metadata processor stamps the same run_id on each message body, so per-event records correlate back to the lifecycle events emitted by the edge.

Benefits of Edge-Emitted Lineage

Audit & Compliance

  • Tamper-evident, signed event stream from every node — supports SOC 2, ISO 27001, regulated-industry audit requirements
  • Lineage record produced where the data lives, not reconstructed after the fact
  • Per-run identity, dataset references, and outcome facets satisfy the "who, what, when" questions in regulatory reviews
  • File-transport mode lets air-gapped or sovereignty-bound sites produce lineage without phoning home

Data Governance

  • Populate Marquez, DataHub, and other OpenLineage-aware catalogs without writing connectors per pipeline
  • Discover which jobs read or write each dataset across a fleet, automatically
  • Distinguish natural completions from operator interventions (COMPLETE vs ABORT) for incident reviews
  • Custom expanso_run facet carries per-run counters (records in/out, bytes in/out, error count) for capacity and quality reporting

Operational Resilience

  • Lineage delivery never blocks or fails the pipeline — drops are counted and visible, but do not corrupt runtime behavior
  • HTTP failures surface in the node's HealthTracker for the affected execution, visible via expanso execution describe
  • A single, low-cardinality metric (lineage_events_dropped_total) drives alerts without overwhelming telemetry budgets
  • File transport keeps lineage flowing even when the central backend is down

Common Patterns

Edge-First Audit Trail

Configure every edge node to emit lineage to a regional collector with signed events. Auditors verify the per-run signature against each node's published key to prove that the record came from the producing node and was not altered downstream. See Verify Lineage Event Signatures for the verification recipe.

Air-Gapped Collection

Use the file transport on disconnected nodes. A standard log-shipper (Fluent Bit, vector, your own script) reads the JSON-Lines file and forwards events to a regional collector when connectivity is available. Each line is a complete, schema-conformant event; ordering and replay are managed by the shipper.

lineage:
enabled: true
transport: file
file:
path: /var/lib/expanso/lineage/events.jsonl
rotation_size_mb: 64

Correlated Body Metadata

Pair the lineage emitter with the metadata processor so every message carries the same run_id that appears on the lifecycle event. Downstream stores can join records back to the lifecycle events without bespoke joins.

pipeline:
processors:
- metadata:
include: [core, pipeline]
target: body
format: nested
body_key: lineage

Regional Marquez per Sovereignty Zone

Run one Marquez instance per region. Each edge node's lineage config points at its regional endpoint, so events stay within the sovereignty boundary while still feeding a centralized governance view through Marquez's existing federation patterns.

Failure Forensics

FAIL events carry an errorMessage facet with the classified pipeline error and stack trace. Catalogs that index lineage events become a queryable history of pipeline failures across the fleet — useful for postmortems and for spotting silent regressions in production.

Example Use Cases

  • Regulated financial services producing per-execution audit trails for every trade-data pipeline, with signed lineage events feeding a compliance catalog and a separate audit archive
  • Healthcare data platforms demonstrating HIPAA "minimum necessary" by recording which datasets each pipeline read at the edge and what it wrote downstream
  • Multi-region SaaS running Marquez per region for data sovereignty, with cross-region lineage roll-up through Marquez federation rather than centralizing the raw events
  • Industrial / IoT operators with disconnected sites using the file transport to emit lineage locally and ship to a central collector when connectivity returns

Getting Started

  1. Enable emission — follow the OpenLineage Emission how-to. The minimal change is two lines of edge config plus a restart.
  2. Stand up Marquez locally — the how-to includes a docker-compose.marquez.yaml for seeing your edge's events in a UI within minutes.
  3. Pair with body metadata — add the metadata processor to your pipelines so per-message records carry the same run_id.
  4. Verify signatures downstream — once events are flowing, plug in the signature verification recipe at your collector or audit boundary.