What is Expanso and how does it work?

Expanso is a managed platform for deploying intelligent data pipelines at the edge. It processes data where it's generated - reducing bandwidth, latency, and costs. You deploy lightweight agents on your infrastructure, build pipelines using our visual builder or YAML, and control everything from a central SaaS platform.

Can I run AI/ML models directly in my data pipelines?

Yes! Expanso supports running ONNX, TensorFlow Lite, and other models as native pipeline steps. Execute low-latency inference on streaming data, enrich events with model outputs (like risk scores), and make decisions at the edge without cloud round-trips.

How many pre-built components are available?

Expanso provides 200+ pre-built components including inputs (Kafka, HTTP, files), processors (transformations, filtering, PII masking, aggregations), and outputs (S3, Snowflake, Datadog, Splunk). Browse the complete catalog in our Component Reference.

Do I need to write code to build pipelines?

No - use our drag-and-drop visual pipeline builder to create sophisticated pipelines without code. For advanced use cases, you can also write pipelines in YAML or use the Bloblang transformation language for complex data mappings.

How does Expanso help with data governance and compliance?

Expanso includes built-in governance features: automatic PII detection and masking, policy enforcement at the edge, RBAC, SSO integration, and comprehensive audit trails. Mask sensitive data before it ever leaves your network.

Job Deployments with Health Checks and Rollback

When you update jobs running on edge nodes, Expanso validates new versions before fully committing to them. If health checks fail, the orchestrator automatically rolls back to the previous version to keep your edge infrastructure stable.

This guide explains how job deployments work, how to configure health checks, and how the rollback system keeps your edge infrastructure resilient.

Overview

When you deploy a standard job, the new version starts immediately. Deployment jobs work differently—they enter a validation period with health checks before becoming fully active. If validation fails, Expanso automatically reverts to the last working version.

This gives you automatic rollback when deployments fail, prevents bad deployments from destabilizing your edge fleet, and keeps things running even during failed updates.

Job State Lifecycle

When you deploy a job, it moves through different states as it rolls out to your edge fleet and starts processing data. Knowing these states helps you monitor deployments and debug issues when things go wrong.

Job States vs. Execution States

Expanso tracks two different kinds of states, and it's important to understand the difference:

Job states show the overall status of your job across your entire fleet (like deploying, running, or degraded)
Execution states show what's happening with individual pipeline instances on specific nodes (like starting, validating, or running)

Single Source of Truth

The job state is what matters for monitoring—it reflects the deployment and health status of your job across all edge nodes. Execution states are implementation details that the orchestrator uses internally.

Job State Transitions

When you deploy or update a job, it flows through these states:

pending → queued → deploying → running
                       ↓
                   rollout_paused (manual pause)
                       ↓
                   rollout_failed (health checks failed)

After a job reaches running, it can transition to degraded if health issues pop up:

running ↔ degraded (daemon jobs only)

State Descriptions

Here's what each state means and when you'll see it.

pending: Your job is created but hasn't been scheduled yet. The orchestrator is preparing to start the deployment.

queued: The job is scheduled but there aren't any nodes available to run it. You'll see this when your fleet is at capacity or when no nodes match the job's selector labels.

deploying: The rollout is actively happening. The orchestrator is deploying the job to nodes, and executions are going through health validation.

During this state, the job is eligible for automatic rollback if health checks fail. This is the critical window where Expanso validates your deployment before it goes fully live.

running: Everything's healthy and stable. The rollout completed successfully, and your job is processing data normally across the fleet.

rollout_paused: You manually paused the rollout. The deployment is frozen—no new nodes get updated, but nodes that are already running the job keep working.

You can resume the rollout whenever you're ready, or roll back to the previous version if something looks wrong.

rollout_failed: Health checks failed during deployment, and Expanso automatically rolled back the job. It won't make further deployment progress until you fix the issue and redeploy.

This happens when the orchestrator detects that executions aren't passing health validation during the deploying state.

degraded: Your job was running fine, but some executions have become unhealthy. This only applies to daemon-type jobs that continuously retry failed executions.

The orchestrator monitors degraded jobs but doesn't automatically roll back, since the deployment was previously validated. Degradation usually means there's an issue with specific nodes or data sources, not the job configuration itself.

completed: The job finished successfully. This only applies to batch-type jobs that have a defined end state.

failed: The job failed—typically this means all executions failed or the job hit a terminal error it couldn't recover from.

stopped: You explicitly stopped the job using the API or CLI. The job won't restart until you start it again.

Rollout Completion Semantics

Here's a gotcha to watch out for: the job's status.rollout.completed_at timestamp tells you when a rollout finished—not the job state.

Checking Rollout Status

A rollout is complete when completed_at has a non-null value, regardless of whether the job state is running, degraded, or rollout_failed.

When you're checking if a rollout is still active, use completed_at == null to detect it—don't rely on the job state.

Daemon vs. Ops Jobs

State transitions work differently depending on your job type.

Daemon jobs run continuously and process data indefinitely:

They can transition between running and degraded based on ongoing health
Edge nodes retry failed executions with exponential backoff
The orchestrator monitors degraded daemon jobs but doesn't immediately roll them back

Ops jobs are one-time operations that finish when the work is done:

They transition from running to completed or failed when finished
They don't use the degraded state since they're not continuously retrying
There's no automatic rollback after completion—if you need to undo changes, use manual rollback

How Deployments Work

Execution State Lifecycle

When you deploy a new version of a job with deployment configuration, executions go through an extended lifecycle:

Pending → Starting → Validating → Running
                         ↓
                      Failed → Orchestrator Rollback
                         ↓
                      Degraded (retrying)

Here's what each state means:

Pending means the execution is scheduled but not started yet. Starting means it's being initialized. Validating is where things get interesting—the execution is running but undergoing health validation, so it's still eligible for rollback. Running means the execution passed health checks and is stable. Failed means the execution is terminal (typically for ops-type pipelines). Degraded means the execution is unhealthy but retrying (typically for daemon-type pipelines).

The Validating state is unique to deployments. During this window, health checks evaluate the execution using consecutive time intervals, a deadline limits how long validation can take, and if validation fails, the orchestrator initiates rollback. Once an execution reaches Running state, it's marked as stable.

When a daemon-type pipeline encounters failures, edge nodes mark the execution as Degraded and retry locally with backoff. The orchestrator monitors these Degraded executions—if too many nodes fail, it triggers a coordinated rollback across your fleet.

Health Check Configuration

When you deploy a new job, you want to make sure it's actually working before it takes over from the previous version. Health checks let Expanso watch your deployment as it starts up, validate that it's processing data without errors, and automatically roll back if something goes wrong.

Expanso uses window-based health evaluation—it watches your job over consecutive time intervals and counts how many are healthy vs. unhealthy. This prevents both false positives (one lucky success) and false negatives (one transient error) from affecting your deployment.

Here's how to configure them:

name: my-api-service
type: pipeline
config:
  # ... your pipeline config ...
rollout:
  health_check:
    interval: 10s              # Evaluation window duration (default: 10s)
    success_threshold: 2       # Consecutive healthy windows needed (default: 2)
    failure_threshold: 3       # Consecutive unhealthy windows before rollback (default: 3)
    max_error_rate: 0.10       # Maximum error rate per window (default: 0.10 = 10%)
    deadline: 5m               # Maximum time to wait for validation
  # Other rollout options...
selector:
  match_labels:
    env: production

interval (default: 10s) sets how long each health evaluation window lasts. The system calculates error rates per interval, not over the lifetime of the execution.

success_threshold (default: 2) is how many consecutive healthy intervals you need before the execution is validated and marked as stable. This prevents transient successes from passing validation.

failure_threshold (default: 3) is how many consecutive unhealthy intervals trigger a rollback. This prevents transient errors from causing unnecessary rollbacks.

max_error_rate (default: 0.10) sets the maximum error rate allowed within an interval. If errors exceed 10% of processed messages in a window, that window is marked unhealthy.

deadline (default varies) sets the maximum time allowed for the execution to achieve validation. If this expires without reaching the success threshold, the execution fails and triggers rollback.

How It Works

When your execution starts, it enters the Starting state. Once it begins running, it transitions to Validating.

During validation, the health check evaluates each interval. If the error rate is below max_error_rate, the window is healthy. Otherwise it's unhealthy. Idle windows with no traffic don't count toward either threshold.

After success_threshold consecutive healthy windows, the execution transitions to Running (stable). After failure_threshold consecutive unhealthy windows, it's marked as Failed and triggers rollback.

If the deadline expires without validation, the execution is also marked as Failed and rolls back.

How Health Evaluation Works

During validation, Expanso checks each pipeline component separately to pinpoint exactly what's failing. Instead of getting a vague "pipeline unhealthy" message, you'll know which specific input, processor, or output has a problem.

Component-by-Component Evaluation

Expanso applies different health checks depending on the component type:

Inputs (e.g., input.kafka.0, input.http_server.0):

Checks connection errors: input_connection_failed and input_connection_lost metrics
Connection failures mark the input as unhealthy
Error rate checks don't apply—inputs receive data, they don't process it

Processors (e.g., processor.mapping.0, processor.filter.1):

Checks error rate: ratio of processor_error to processor_received messages
If error rate exceeds max_error_rate threshold, the processor is unhealthy
Connection checks don't apply—processors don't maintain connections

Outputs (e.g., output.kafka.0, output.http_client.0):

Checks both connection errors AND error rate
Connection metrics: output_connection_failed and output_connection_lost
Error rate: output_error / output_sent
Either connection failures or high error rate marks the output as unhealthy

Fail-Fast Strategy

Expanso checks components in order and stops at the first unhealthy one. This fail-fast approach gives you:

Precise diagnostics: You know exactly which component failed (e.g., "Kafka Output [kafka]: connection failed")
Early detection: No need to wait for all components to be evaluated
Clear root cause: Instead of "pipeline unhealthy," you get "Data Filter [bloblang]: error rate 15.2% exceeds threshold 10.0%"

Activity-Based Evaluation

Components with no traffic during a window become Pending (not Unhealthy). This makes sense during:

Startup: Pipeline is initializing, connections are establishing
Low traffic: Waiting for data to arrive
Idle intervals: Between bursts of activity

Pending windows don't count toward consecutive healthy or unhealthy thresholds. This prevents idle periods from triggering unnecessary rollbacks.

Health Check Lifecycle

For each interval window (default: 10 seconds):

Metrics collection: OTel exporter forwards component-specific metrics to the health tracker
Window evaluation: When the interval ends, Expanso evaluates each component:
- Check for activity (any traffic?)
- Check connection errors (inputs and outputs)
- Check error rate (processors and outputs)
Consecutive tracking:
- Healthy window → increment consecutive healthy count, reset unhealthy count
- Unhealthy window → increment consecutive unhealthy count, reset healthy count
- Pending window → don't change either count
State transition:
- If consecutive healthy count reaches success_threshold → transition to Running (stable)
- If consecutive unhealthy count reaches failure_threshold → transition to Failed (triggers rollback)
- If deadline expires → transition to Failed (triggers rollback)

Continuous Health Monitoring

Health monitoring doesn't stop after your deployment passes initial validation. Expanso continuously monitors health throughout the entire execution lifecycle—whether the execution is in Validating, Running, or Degraded state.

This continuous approach ensures that stable deployments stay tracked for health degradation, late-joining nodes get properly validated before being marked as stable, and degraded executions must prove they're healthy before returning to Running state.

How Continuous Monitoring Works

During Validating State:

When an execution is first validating, health checks evaluate it using the configured deadline. If health is proven (HealthHealthy), the execution transitions to Running and gets a StableAt timestamp. If health fails (HealthUnhealthy), the execution transitions to Degraded for daemon jobs or Failed for ops jobs.

Here's where it gets interesting: if the deadline expires while health is still HealthPending, the system gives "benefit of the doubt" and marks the execution as Running. This deadline-based benefit of doubt helps deployment waves progress even when data traffic is low or health checks haven't reported yet.

During Running State (Stable):

Once an execution has a StableAt timestamp, it's considered stable—but Expanso keeps monitoring. If health becomes HealthUnhealthy, the execution transitions to Degraded. There's no deadline here—the execution stays Running until it's proven unhealthy. This prevents stable deployments from being unnecessarily marked as degraded during idle periods.

During Running State (Late-Joiners):

Late-joiners are nodes that come online after the initial deployment. These executions don't have a StableAt timestamp yet, and they're handled differently. They wait indefinitely for health to be proven—there's no deadline pressure and no "benefit of doubt" for late-joiners. They must achieve HealthHealthy status before getting a StableAt timestamp. This ensures late-arriving nodes don't get marked as stable without validation.

During Degraded State:

Degraded executions must prove HealthHealthy to transition back to Running. There's no "benefit of doubt" for recovery—health must be explicitly proven. If health remains HealthUnhealthy or HealthPending, the execution stays Degraded. Edge nodes continue retrying with exponential backoff while reporting Degraded state, preventing executions from bouncing between states without actual recovery.

Late-Joiner Behavior

A "late-joiner" is an edge node that comes online after a job deployment has already rolled out to other nodes. This commonly happens when you add a new edge node to your fleet after a job is deployed, an existing node was offline during deployment and reconnects later, or an edge node restarts and needs to catch up with the current job version.

How Late-Joiners Are Handled:

When a late-joiner execution starts, it's placed directly into the Running state (bypassing Validating). However, it doesn't immediately get a StableAt timestamp—it must prove health first.

The key difference from initial deployment validation is that late-joiners wait indefinitely for health proof. There's no deadline, and there's no "benefit of doubt" if health data is pending. The execution must explicitly achieve HealthHealthy status before receiving a StableAt timestamp. This conservative approach ensures that late-arriving nodes are fully validated before being considered part of your stable fleet, even if they join long after the initial deployment wave.

Example Timeline:

Here's what this looks like in practice. Job v2 is deployed at 10:00 AM to 50 nodes. All 50 nodes validate health and reach Running (stable) state by 10:05 AM. A new node joins your fleet at 2:00 PM and receives Job v2. The new node starts the execution in Running state (no StableAt yet), and the system continuously monitors health without any deadline. Once HealthHealthy is confirmed, the execution gets its StableAt timestamp, and the node is now fully validated and part of your stable deployment.

State Transition Summary

Here's how executions transition between states based on health monitoring:

Current State	Health Status	Deadline Expired?	Transition
Validating	HealthHealthy	N/A	→ Running + StableAt
Validating	HealthUnhealthy	N/A	→ Degraded (daemon) or Failed (ops)
Validating	HealthPending	Yes	→ Running + StableAt (benefit of doubt)
Validating	No Runtime	Yes	→ Degraded (daemon) or Failed (ops)
Running (stable)	HealthUnhealthy	N/A	→ Degraded
Running (stable)	HealthHealthy or HealthPending	N/A	No change (stays Running)
Running (late-joiner)	HealthHealthy	N/A	Set StableAt (now fully stable)
Running (late-joiner)	HealthUnhealthy	N/A	→ Degraded
Running (late-joiner)	HealthPending	N/A	Wait indefinitely (no benefit of doubt)
Degraded	HealthHealthy	N/A	→ Running
Degraded	HealthUnhealthy or HealthPending	N/A	Stay Degraded (no benefit of doubt)

Understanding Deadlines and Benefit of Doubt

The deadline only applies during the Validating state to help initial deployment waves progress. The "benefit of doubt" is only given during Validating—allowing low-traffic deployments to proceed when health data hasn't arrived yet. Late-joiners and degraded executions must prove health explicitly before transitioning to a stable state.

New Nodes During Active Deployments

When you scale your edge infrastructure by adding new nodes, the scheduler needs to decide whether to deploy jobs to them. This decision depends on the current deployment state.

The behavior here is different from late-joiners—nodes that come online after a deployment has already finished. During an active deployment, the scheduler protects you from accidentally deploying problematic versions to fresh capacity.

When New Nodes Get Jobs

Your new nodes will receive job deployments in most cases:

No active deployment: Normal scheduling applies. New nodes get the current job version immediately.
In-progress deployment: New nodes join the deployment and receive the new version as part of the rollout.
Completed deployment: The deployment is done, so normal scheduling resumes.
Cancelled deployment: The deployment was aborted, so normal scheduling resumes.

In these states, your infrastructure scales normally. New capacity comes online without manual intervention, and you don't need to do anything special.

When New Nodes Are Blocked

There are two deployment states where new nodes will NOT receive job deployments:

Paused deployment: The deployment is frozen, waiting for your action. New nodes remain idle until you resume or cancel the deployment.
Failed deployment: The deployment failed health checks and triggered a rollback. New nodes remain idle until you resolve the issue and resume or cancel the deployment.

This blocking behavior is intentional—it prevents new nodes from receiving potentially unstable job versions.

Here's why this matters: if you're investigating a deployment issue or waiting for a maintenance window, you don't want fresh capacity automatically picking up a problematic version. The scheduler holds back new nodes until you explicitly decide how to proceed.

Unblocking New Nodes

If you add nodes during a paused or failed deployment, you'll need to take action to unblock them:

Resume the rollout if you want to continue rolling out the new version:

curl -X POST https://api.expanso.io/jobs/{job-id}/rollout/resume \
  -H "Authorization: Bearer $TOKEN"

Cancel the rollout if you want to stop the rollout and let new nodes get the stable version:

curl -X POST https://api.expanso.io/jobs/{job-id}/rollout/cancel \
  -H "Authorization: Bearer $TOKEN"

Roll back if you want to revert all nodes (including new ones) to the previous version:

curl -X POST https://api.expanso.io/jobs/{job-id}/rollback \
  -H "Authorization: Bearer $TOKEN"

Monitoring Blocked Nodes

Check your fleet status to see if nodes are waiting for jobs. Idle nodes during an active deployment might indicate a paused or failed state that needs attention.

Job State Summary

The scheduler's behavior for new nodes depends on the current job state:

Job State	New Nodes Get Job?	What To Do
`pending`, `queued`, `running`, `completed`, `stopped`	✅ Yes	Nothing—normal scheduling
`deploying`	✅ Yes	Nothing—nodes join the rollout
`rollout_paused`	❌ No	Resume or roll back the rollout
`rollout_failed`	❌ No	Investigate, then redeploy or roll back
`degraded`	✅ Yes	Monitor—job is running but unhealthy

When a job is in rollout_paused or rollout_failed state, the scheduler blocks new nodes from receiving the job. This prevents potentially unstable versions from deploying to fresh capacity while you're investigating issues.

Execution States and Failure Handling

When deployments fail health checks, edge nodes report their execution state to the orchestrator. Understanding these states helps you monitor deployments and troubleshoot issues.

Execution States

Running: The execution passed health validation and is stable. This is the healthy, steady state for a deployment.

Degraded: The execution is unhealthy but retrying. This typically happens with daemon-type pipelines (like continuous data processing) that can recover from transient failures. Edge nodes retry locally with 30-second backoff intervals, giving the pipeline time to recover.

Failed: The execution is terminal and won't retry. This typically happens with ops-type pipelines or when the orchestrator explicitly stops a failed deployment.

Validating: The execution is running but still undergoing health checks. It hasn't yet proven stable, so it's still eligible for rollback if health checks fail.

How Edge Nodes Handle Failures

Edge nodes don't decide when to roll back—they report state to the orchestrator, which makes all rollback decisions.

When a daemon pipeline fails health checks, the edge node marks it as Degraded and retries locally with exponential backoff (starting at 30 seconds). The pipeline keeps attempting to recover while reporting its degraded state to the orchestrator.

When an ops pipeline fails, the edge node marks it as Failed (terminal state) and reports this to the orchestrator. There's no local retry since ops pipelines are typically one-time operations.

How the Orchestrator Decides Rollback

The orchestrator monitors execution states across your entire edge fleet. When calculating deployment health, it counts both Failed and Degraded executions as failures.

If the number of failed/degraded nodes exceeds your deployment's health threshold, the orchestrator initiates a coordinated rollback. The orchestrator atomically:

Updates the job spec to the previous version's configuration
Cancels the failed deployment (marked as failed with rollback message)
Creates a new deployment with immediate strategy using the rolled-back spec
Increments the job version (e.g., v1 → v2 fails → v3 with v1's spec)

This coordinated approach ensures all nodes return to a known-good configuration together, even if some nodes are retrying locally.

Rollback Prevention

The system tracks when executions become stable to prevent rollback loops. When validation succeeds, the execution gets a StableAt timestamp. If a previously-validated execution degrades later, it won't trigger another rollback—it uses normal retry instead.

This prevents scenarios where version A is stable, version B fails and rolls back (creating v3 with A's spec), and then v3 also fails, which would otherwise trigger endless rollback loops.

First Deployment Behavior

For the first deployment of a job (when no previous version exists), the rollback behavior is different. If the deployment fails validation, the orchestrator stops the execution since there's no healthy version to roll back to.

This prevents infinite retry loops when the first deployment fails validation. If there's no healthy version, the system stops rather than looping.

Manual Deployment Control

Expanso automatically handles deployment validation and rollback, but you can also take manual control through job-scoped APIs. Just provide the job name or ID—no deployment ID needed. You can pause, resume, or roll back deployments on demand.

When to Use Manual Controls

Pause stops a rolling deployment temporarily. The current wave freezes—no new nodes get updated, but already-updated nodes keep running. Use this to investigate issues, coordinate with other systems, or wait for a maintenance window.

Resume continues a paused rollout. The deployment picks up where it left off and progresses through the remaining waves.

Roll back reverts to the previous version immediately. This is useful when you spot critical issues before automatic health checks trigger, or when you want to roll back during a maintenance window to minimize user impact.

Job-Scoped Rollout APIs

Expanso provides job-scoped APIs to control rollouts. You only need the job name:

POST /jobs/{id}/rollout/pause
POST /jobs/{id}/rollout/resume
POST /jobs/{id}/rollback

All endpoints require authentication and support dry-run mode and reason tracking.

Pausing a Rollout

Pause an active rollout to temporarily stop its progress:

curl -X POST https://cloud.expanso.io/api/v1/jobs/my-job/rollout/pause \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Pausing for investigation"
  }'

Response:

{
  "job_id": "job-abc123",
  "deployment_id": "dep-xyz789"
}

Dry run mode lets you check if the operation would succeed without actually pausing:

curl -X POST https://cloud.expanso.io/api/v1/jobs/my-job/rollout/pause \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Checking if pause is possible",
    "dry_run": true
  }'

Error cases:

409 Conflict: No active deployment to pause, or deployment is already paused
404 Not Found: Job doesn't exist
400 Bad Request: Deployment is in a terminal state (completed, failed, canceled)

Via CLI:

You can also pause using the CLI:

# Pause rollout by job name
expanso-cli job pause-rollout my-job

# Pause with a reason
expanso-cli job pause-rollout my-job --reason "investigating performance issue"

# Dry run to check if pause is possible
expanso-cli job pause-rollout my-job --dry-run

Resuming a Deployment

Resume a paused deployment to continue the rollout:

curl -X POST https://cloud.expanso.io/api/v1/jobs/my-job/rollout/resume \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Investigation complete, resuming rollout"
  }'

Response:

{
  "job_id": "job-abc123",
  "deployment_id": "dep-xyz789"
}

The deployment continues from where it was paused—if it was in the middle of wave 2, it completes wave 2 and then proceeds to wave 3.

Via CLI:

You can also resume using the CLI:

# Resume rollout by job name
expanso-cli job resume-rollout my-job

# Resume with a reason
expanso-cli job resume-rollout my-job --reason "issue resolved, continuing deployment"

# Dry run to check if resume is possible
expanso-cli job resume-rollout my-job --dry-run

Rolling Back a Job

Roll back a job to a previous version. This cancels any active deployment and creates a new immediate deployment with the previous version's configuration:

# Roll back to previous version (automatic)
curl -X POST https://cloud.expanso.io/api/v1/jobs/my-job/rollback \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Manual rollback due to increased error rate"
  }'

Response:

{
  "job_id": "job-abc123",
  "deployment_id": "dep-new456",
  "from_version": 3,
  "to_version": 4,
  "target_spec": 2,
  "created_new": true,
  "cancelled_prior": true,
  "warnings": []
}

Roll back to a specific version by providing the version number:

curl -X POST "https://cloud.expanso.io/api/v1/jobs/my-job/rollback?version=1" \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Rolling back to last known good version"
  }'

Understanding the response:

Field	Description
`from_version`	The version you're rolling back from (e.g., 3)
`to_version`	The new version number after rollback (e.g., 4) - increments because the job spec is updated
`target_spec`	The version whose configuration you're rolling back to (e.g., 2)
`created_new`	Always true - rollback creates a new immediate deployment
`cancelled_prior`	True if there was an active deployment that got canceled
`warnings`	Any non-fatal issues encountered

Version semantics:

When you roll back from version 3 to version 2's configuration:

Current deployment (version 3) is canceled
Job spec is reverted to version 2's configuration
Job version increments to 4 (new spec version)
New immediate deployment is created (version 4 with version 2's config)

This means "rollback" creates a new version with the old configuration, not a true rewind.

Error cases:

400 Bad Request: Already at first version (no previous version to roll back to)
404 Not Found: Target version doesn't exist in version history
409 Conflict: Cannot roll back in current state

Common Patterns

Pause for investigation:

# 1. Pause rollout
curl -X POST https://cloud.expanso.io/api/v1/jobs/api-service/rollout/pause \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -d '{"reason": "Investigating error spike"}'

# 2. Investigate (check logs, metrics, etc.)
expanso-cli job logs api-service --level error

# 3. Resume if OK, or roll back if not
curl -X POST https://cloud.expanso.io/api/v1/jobs/api-service/rollout/resume \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -d '{"reason": "False alarm, continuing rollout"}'

Emergency rollback:

# Immediately roll back when you spot critical issues
curl -X POST https://cloud.expanso.io/api/v1/jobs/payment-processor/rollback \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -d '{"reason": "Critical: payment failures detected"}'

Manual vs Automatic Control

When to use manual control:

You spot critical issues before automatic health checks fail
You want to control rollout timing during a maintenance window
You're seeing early warning signs (increased latency, memory usage)
You need immediate action for business reasons
You want to pause to coordinate with other systems

When to rely on automatic behavior:

Normal deployment operations with proper health check configuration
Rolling out to large fleets where manual intervention isn't practical
When you want consistent behavior based on objective health metrics
During off-hours when no one is actively monitoring

Manual Rollback

Something's wrong with your deployment. Maybe you're seeing increased error rates, or users are reporting issues—and you don't want to wait for automatic rollback thresholds to kick in. You can manually trigger a rollback using the Expanso API or CLI to immediately revert to the previous version.

When to Manually Roll Back

You're seeing critical issues before automatic thresholds trigger. Your monitoring shows a spike in errors or performance degradation right after a deployment. Rather than wait for automatic health checks to fail, you want to roll back immediately to minimize impact.

Early warning signs indicate problems. You notice unusual patterns—increased latency, memory usage climbing, or warning messages in logs. The deployment hasn't technically failed health checks yet, but you want to roll back during your maintenance window before users are affected.

Triggering a Rollback

Let's walk through how to trigger a rollback using both the API and CLI.

Via API:

curl -X POST https://cloud.expanso.io/api/v1/jobs/{job-id}/rollback \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "reason": "Rolling back due to increased error rate"
  }'

Via CLI:

Trigger the rollback directly on the job:

# Rollback to previous version
expanso-cli job rollback my-job --reason "Manual rollback due to errors"

# Or rollback to a specific version
expanso-cli job rollback my-job --version 3 --reason "Rollback to known good version"

Want to preview the rollback first? Use dry run mode to see what would happen without actually triggering it:

curl -X POST https://cloud.expanso.io/api/v1/jobs/{job-id}/rollback \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "dry_run": true
  }'

This returns what would happen—which version you'd roll back to, any warnings, and whether the rollback can proceed—without actually making changes.

What Happens During Rollback

Once you trigger a rollback, here's what happens behind the scenes:

First, the system validates that rollback is possible. The deployment can't be in a terminal state like completed or canceled, and there must be a previous version to roll back to (you can't roll back your first deployment—there's nothing to roll back to).

The orchestrator atomically performs the rollback in a single transaction:

Updates the job spec to the target version's configuration
Cancels the active deployment (if one exists) and marks it as failed
Creates a new immediate deployment with the rolled-back spec
Increments the job version to create a new version number

The new deployment rolls out immediately using the immediate deployment strategy. All nodes receive the rolled-back configuration at once, ensuring consistent state across your fleet.

The job version increments even though you're using an older spec. For example: v1 → v2 (fails) → v3 (with v1's spec). This preserves a clear audit trail showing the rollback occurred.

Understanding the Rollback Response

The rollback API returns information about what happened:

{
  "job_id": "job-abc123",
  "deployment_id": "deploy-new456",
  "from_version": 3,
  "to_version": 4,
  "target_spec": 2,
  "created_new": true,
  "cancelled_prior": true,
  "warnings": []
}

Field	Description
`job_id`	ID of the job that was rolled back
`deployment_id`	ID of the new rollback deployment that was created
`from_version`	Version you're rolling back from (e.g., 3)
`to_version`	New version number after rollback (e.g., 4) - always increments
`target_spec`	Version whose spec was used for rollback (e.g., 2)
`created_new`	Always true - rollback creates a new deployment
`cancelled_prior`	True if an active deployment was cancelled
`warnings`	Any non-fatal issues encountered during rollback

Rollback Limitations

Not all deployments can be rolled back. Here's what you need to know:

Can Roll Back	Cannot Roll Back
Failed deployments	First deployment (no previous version exists)
In-progress deployments	Already completed deployments
Pending deployments	Already canceled deployments

Why Can't I Roll Back Completed Deployments?

Once a deployment successfully completes, it's considered stable. To revert to an earlier version, create a new deployment targeting that version rather than rolling back the completed one.

Monitoring Rollback Progress

After triggering a rollback, you'll want to watch it progress. Here's how to monitor the state changes:

# Check the job rollout status
expanso-cli job describe my-job

# Watch execution states in real-time
expanso-cli execution list --job my-job --watch

# Verify executions at the rollback version are coming online
expanso-cli execution list --job my-job --version <to-version>

What to look for:

Job status: Should show rolling rollback (in progress) in the ROLLOUT section
Rollback Version: Indicates the target version for the rollback
Progress: Shows how many nodes have been rolled back
Executions: Should show the rollback version starting across all nodes

Rollback Takes Time

Depending on your fleet size and network conditions, rollback can take several minutes. Edge nodes need to stop current executions and start new executions with the rolled-back configuration. This is normal—monitor progress and ensure nodes are transitioning as expected.

Configuration Examples

Basic Deployment with Health Checks

name: data-processor
type: pipeline
config:
  input:
    http_server:
      path: /ingest
  pipeline:
    processors:
      - mapping: |
          root.processed = true
          root.timestamp = now()
  output:
    kafka:
      addresses: ["kafka:9092"]
      topic: processed-data

rollout:
  health_check:
    interval: 10s
    success_threshold: 2
    failure_threshold: 3
    max_error_rate: 0.10
    deadline: 2m

selector:
  match_labels:
    env: production
    region: us-west

Behavior:

Health is evaluated every 10 seconds (default)
Needs 2 consecutive healthy windows to be validated (20 seconds minimum)
3 consecutive unhealthy windows trigger rollback (30 seconds of sustained errors)
If not healthy within 2 minutes, rollback occurs
Automatic rollback to previous version if validation fails

Conservative Deployment for Critical Services

name: critical-api
type: pipeline
config:
  # ... pipeline config ...

rollout:
  health_check:
    interval: 30s              # Longer evaluation windows
    success_threshold: 4       # More consecutive healthy windows
    failure_threshold: 2       # Fail faster on errors
    max_error_rate: 0.05       # Stricter error rate (5%)
    deadline: 10m              # More time to stabilize

selector:
  match_labels:
    service: critical
    env: production

Behavior:

Health evaluated every 30 seconds (longer windows for stability)
Requires 4 consecutive healthy windows (2 minutes minimum)
Only 2 unhealthy windows needed to trigger rollback (fail fast)
Stricter error threshold of 5% instead of default 10%
More conservative approach for critical infrastructure

Fast-Paced Deployment for Non-Critical Services

name: log-aggregator
type: pipeline
config:
  # ... pipeline config ...

rollout:
  health_check:
    interval: 5s               # Faster evaluation
    success_threshold: 2       # Default threshold
    failure_threshold: 3       # Default threshold
    deadline: 1m               # Fast feedback

selector:
  match_labels:
    criticality: low

Behavior:

Health evaluated every 5 seconds (faster feedback)
Only needs 2 consecutive healthy windows (10 seconds minimum)
Fails fast (1 minute deadline) for quick iteration
Suitable for non-critical workloads where speed matters

Monitoring Deployments

Once you've kicked off a job deployment, you'll want to track how it's progressing across your edge nodes. Expanso gives you several ways to monitor rollout status, check which nodes have completed validation, and identify any issues that come up during the deployment process.

Check Job Status

To see the overall rollout status and progress for a job:

expanso-cli job describe data-processor

When a rollout is active, you'll see a ROLLOUT section with detailed progress information:

ROLLOUT
-------
Status:            rolling (in progress)
Progress:          75% (3/4 nodes)
Wave:              2/3
Failed Nodes:      0
Health Check:      5m deadline, 10% max error rate
Started:           2025-01-03T10:15:30Z

The ROLLOUT section shows you everything you need to know about the deployment. The Status field tells you which rollout strategy is running (rolling or immediate) and what state it's in—in progress, paused, failed, or completed. Progress shows the percentage complete along with how many nodes have been updated out of the total. For rolling deployments, Wave indicates which wave you're currently in and how many total waves there are. If any nodes fail their health checks, you'll see a Failed Nodes count. The Health Check line summarizes your validation configuration, and Started shows when the rollout began. Once the rollout finishes, you'll also see a Completed timestamp.

When a job is actively running a rollback, the Status field shows the rollback type and which version you're rolling back to:

Status:            rolling rollback (in progress)
Rollback Version:  42

If the rollout is paused or has failed, you'll also see the Rollback Version field indicating the target version for the rollback.

Check Job List

To get a quick overview of all jobs and their rollout status:

expanso-cli job list

Jobs that are actively running a rollback operation show a (rollback) indicator in the VERSION column:

NAME              TYPE      STATE         VERSION          LABELS
data-processor    pipeline  deploying     43 (rollback)    env=prod
log-aggregator    pipeline  running       15               env=prod

You'll see the (rollback) indicator when the rollout type is "rollback" and the job state is deploying, rollout_paused, or rollout_failed.

Check Execution State

To monitor execution states and see how validation is progressing on individual nodes:

# List executions for a job
expanso-cli execution list --job data-processor

# Sample output:
# EXECUTION ID       STATE        NODE                STARTED
# exec-abc-123      validating   edge-node-01        2m ago
# exec-xyz-789      running      edge-node-02        15m ago (stable)

The execution state tells you where each node is in the validation process. When you see validating, that execution is currently in its validation window. Once it passes validation, the state changes to running and gets marked as stable. If an execution hits failed, that means it failed validation and will trigger (or has triggered) a rollback.

View Execution Details

For detailed information about a specific execution:

expanso-cli execution describe exec-abc-123

The output now includes rollout-specific fields:

EXECUTION DETAILS
-----------------
ID:              exec-abc-123
Job:             data-processor
Version:         43
State:           validating
Node:            edge-node-01
Created:         2025-01-03T10:20:45Z
Updated:         2025-01-03T10:21:15Z
Rollout Wave:    2

Once the execution passes validation and becomes stable, you'll see additional information:

EXECUTION DETAILS
-----------------
ID:              exec-xyz-789
Job:             data-processor
Version:         43
State:           running
Node:            edge-node-02
Created:         2025-01-03T10:15:30Z
Updated:         2025-01-03T10:16:45Z
Rollout Wave:    1
Stable At:       2025-01-03T10:16:45Z

The Rollout Wave field shows which wave this execution belongs to (only displayed if greater than 0). Once validation completes, you'll see Stable At with the timestamp when the execution passed its health checks. Use these fields to track validation progress and identify which executions have completed the health check window.

View Health Diagnostics

When a deployment fails validation, detailed diagnostics help you pinpoint exactly what went wrong. Let's look at the health information available in execution status.

Check execution health:

expanso-cli execution describe exec-abc-123

When you describe an execution, you'll see these health diagnostic fields:

status: Pending, Healthy, or Unhealthy
message: Human-readable diagnostic showing component label and type (e.g., "Data Filter [bloblang]: error rate 15.2% exceeds threshold 10.0%")
details: Machine-parseable metadata including:
- component_id: UUID from visual builder (if set)
- component_label: User-friendly component name (if set)
- component_name: Benthos component type (e.g., "bloblang", "kafka", "http_client")
- component_type: Component category (input, processor, or output)
- component_path: Technical Benthos path (e.g., "root.pipeline.processors.0")
- failure_type: Reason for failure (connection_failed, connection_lost, or error_rate)
- consecutive_unhealthy_windows: How many consecutive windows were unhealthy
- error_count: Number of errors in the failing component
- message_count: Number of messages processed
- error_rate: Calculated error rate percentage
- error_message: Most recent error message from the component

You should see output like:

Health Status: Unhealthy
Message: Kafka Output [kafka]: connection failed 3 times (3 consecutive unhealthy windows)
Details:
  component_id: uuid-output-kafka-1
  component_label: Kafka Output
  component_name: kafka
  component_type: output
  component_path: root.output
  failure_type: connection_failed
  consecutive_unhealthy_windows: 3
  connection_failed_count: 3
  error_message: dial tcp 10.0.1.5:9092: connect: connection refused

What These Diagnostics Tell You

This level of detail lets you identify the exact component causing failures, understand whether it's a connection issue or processing error, see the actual error messages to diagnose root cause, and determine if the issue is transient or sustained.

Check for Rollback Events

# View execution history to see state transitions
expanso-cli execution history exec-abc-123

# Look for transitions like:
# Validating → Failed (triggers rollback)
# Then new deployment created with old spec

Best Practices

1. Set Appropriate Thresholds

success_threshold (default: 2 windows):

Too low (1 window): Single lucky success passes validation, more false positives
Too high (5+ windows): Delays safe rollouts, increases deployment time
Recommendation:
- Use default (2) for most services
- Use 3-4 for critical services requiring extra confidence
- Use 1 for fast-iteration development environments

failure_threshold (default: 3 windows):

Too low (1-2 windows): Transient errors cause unnecessary rollbacks
Too high (5+ windows): Bad deployments stay active longer before rollback
Recommendation:
- Use default (3) for most services
- Use 2 for critical services (fail faster)
- Use 4-5 for non-critical services (more tolerance)

interval (default: 10s):

Too short (< 5s): Noisy evaluation, not enough data per window
Too long (> 30s): Slower feedback, delays detection of issues
Recommendation:
- Use default (10s) for most services
- Use 5s for fast-feedback development
- Use 20-30s for services with variable traffic patterns

2. Set Realistic Deadlines

Deadline should account for:

Minimum validation time: success_threshold * interval (e.g., 2 × 10s = 20s)
Startup time (loading configs, establishing connections)
Initialization delays (cache warming, etc.)
Traffic ramp-up time (if no traffic = no health evaluation)

Formula:

Deadline = (success_threshold × interval) + StartupTime + Buffer

Example:

success_threshold: 3
interval: 10s
startup_time: 15s
buffer: 30s

Deadline = (3 × 10s) + 15s + 30s = 75s ≈ 2m

3. Test Rollback Scenarios

Verify rollback works before production:

# Deploy a job version that will fail
expanso-cli job deploy broken-version.yaml

# Watch for automatic rollback creating new deployment
expanso-cli deployment list --job my-job --watch

# Verify job version incremented with old spec
expanso-cli job versions my-job

4. Monitor Deployment Metrics

Track these metrics:

Validation success rate
Time spent in validation state
Rollback frequency
Time to rollback

5. Handle First Deployments

For the first deployment of a job:

No rollback target exists
Failed validation uses normal retry logic
Consider more conservative health check settings

Troubleshooting

Execution Stuck in Validating State

Symptoms: Execution remains in validating state for extended period

Diagnosis:

expanso-cli execution describe <exec-id>
# Check: created_at vs current time
# Check: health_check deadline configuration

Possible Causes:

Execution is flapping (alternating healthy/unhealthy windows)
Not receiving traffic (idle windows don't count toward success threshold)
Deadline is very long
Success threshold requirements not yet met

Solution:

Check edge node logs for execution errors
Verify execution is actually processing data (needs traffic for health evaluation)
Look for alternating healthy/unhealthy patterns (indicates instability)
Consider adjusting thresholds if too conservative

Rollback Not Occurring

Symptoms: Execution fails but doesn't trigger rollback

Diagnosis:

expanso-cli execution describe <failed-exec-id>
# Check: rollback_to_version field (should be > 0)
# Check: stable_at field (should be empty if new deployment)

# Check deployment state
expanso-cli deployment describe <deployment-id>

Possible Causes:

First deployment (no previous version to roll back to)
Execution was previously stable (has stable_at set)
Job doesn't have deployment configuration
Orchestrator rollback hasn't triggered yet (waiting for threshold)

Solution:

Verify deployment configuration exists in job spec
Check if previous version exists using expanso-cli job versions <job>
Review execution history to see if it was previously validated
For cluster-wide issues, check if orchestrator rollback is in progress

Manual Rollback Fails

Symptoms: Manual rollback API call returns error

Diagnosis:

# Try with dry_run to see what would happen
curl -X POST https://cloud.expanso.io/api/v1/jobs/{job-id}/rollback \
  -H "Authorization: Bearer $EXPANSO_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"dry_run": true}'

Common Errors:

"rollback already in progress"

Another rollback is in progress
Wait for current rollback to complete
Check job state: expanso-cli job describe <job-id>

"rollout in terminal state"

Rollout already completed or canceled
Cannot roll back completed rollouts
If you need to revert, deploy the previous version as a new deployment

"no previous version to rollback to"

This is the first deployment (RollbackToVersion = 0)
No previous version exists
Stop the job and redeploy with fixes: expanso-cli job stop <job-id>

Solution:

Review error message for specific constraint violated
Check deployment state and history
Use dry run mode to validate before attempting rollback

First Deployment Fails Validation

Symptoms: First deployment of a job fails validation and execution stops (doesn't retry)

Diagnosis:

# Check if this is first deployment
expanso-cli job describe my-job
# If this is the first deployment, there's no previous version to roll back to

# Check execution state
expanso-cli execution describe <exec-id>
# Should show: state=stopped, message about no rollback target

Behavior:

First deployment has no previous version to roll back to
Edge nodes stop failed executions instead of retrying indefinitely
This prevents infinite retry loops on permanently broken deployments

Solution:

Fix the underlying issue (check validation failure cause)

Deploy corrected version:

# Update job configuration
expanso-cli job deploy fixed-version.yaml

Consider more lenient health check thresholds for initial deployment:

rollout:
  health_check:
    failure_threshold: 5  # More tolerance for startup issues
    deadline: 10m         # More time to stabilize

Infinite Rollback Loops

Symptoms: Job keeps switching between versions

This should not happen due to StableAt protection, but if it does:

Diagnosis:

# Check execution history
expanso-cli execution history <exec-id-version-a>
expanso-cli execution history <exec-id-version-b>

# Look for stable_at timestamps
expanso-cli execution describe <exec-id>

Solution:

Report as a bug - StableAt mechanism should prevent this
Manually stop problematic job: expanso-cli job stop <job-name>
Deploy a known-good version

Target Version Missing from History

Symptoms: Rollback fails with "target version not found" error

Diagnosis:

# Check what versions exist in history
expanso-cli job versions <job-name>

# Check if the target version spec exists
expanso-cli job describe <job-name> --version <target-version>

Possible Causes:

Target version was never created (version gap in history)
Job version history is incomplete or corrupted
Requested version number doesn't exist

Solution:

Verify the version exists in history: expanso-cli job versions <job>
Check for gaps in version sequence
If version history is corrupt, manually deploy a known-good configuration
Use --version flag to specify a version that exists

When to Use Deployments

Use deployment configuration when:

Updating production services that require high availability
Deploying to critical edge infrastructure
You want automatic safety nets for bad deployments
You're rolling out changes across a large edge fleet

Skip deployment configuration when:

Rapidly iterating in development environments
Deploying one-off data processing jobs
Managing system/operational jobs (cleanup tasks, etc.)
Immediate updates are acceptable (no validation period needed)

Edge Node Deployment: How to deploy and manage edge nodes themselves
Fleet Monitoring: Monitor deployment health across your fleet
CLI Job Commands: Manage job deployments via CLI
Execution States: Understanding execution lifecycle

This guide explains how job deployments work, how to configure health checks, and how the rollback system keeps your edge infrastructure resilient.

Overview

This gives you automatic rollback when deployments fail, prevents bad deployments from destabilizing your edge fleet, and keeps things running even during failed updates.

Job State Lifecycle

Job States vs. Execution States

Expanso tracks two different kinds of states, and it's important to understand the difference:

Job states show the overall status of your job across your entire fleet (like deploying, running, or degraded)
Execution states show what's happening with individual pipeline instances on specific nodes (like starting, validating, or running)

Single Source of Truth

Job State Transitions

When you deploy or update a job, it flows through these states:

pending → queued → deploying → running
                       ↓
                   rollout_paused (manual pause)
                       ↓
                   rollout_failed (health checks failed)

After a job reaches running, it can transition to degraded if health issues pop up:

running ↔ degraded (daemon jobs only)