Job Deployments with Health Checks and Rollback
When you update jobs running on edge nodes, Expanso validates new versions before fully committing to them. If health checks fail, the orchestrator automatically rolls back to the previous version to keep your edge infrastructure stable.
This guide explains how job deployments work, how to configure health checks, and how the rollback system keeps your edge infrastructure resilient.
Overview
When you deploy a standard job, the new version starts immediately. Deployment jobs work differently—they enter a validation period with health checks before becoming fully active. If validation fails, Expanso automatically reverts to the last working version.
This gives you automatic rollback when deployments fail, prevents bad deployments from destabilizing your edge fleet, and keeps things running even during failed updates.
Job State Lifecycle
When you deploy a job, it moves through different states as it rolls out to your edge fleet and starts processing data. Knowing these states helps you monitor deployments and debug issues when things go wrong.
Job States vs. Execution States
Expanso tracks two different kinds of states, and it's important to understand the difference:
- Job states show the overall status of your job across your entire fleet (like
deploying,running, ordegraded) - Execution states show what's happening with individual pipeline instances on specific nodes (like
starting,validating, orrunning)
The job state is what matters for monitoring—it reflects the deployment and health status of your job across all edge nodes. Execution states are implementation details that the orchestrator uses internally.
Job State Transitions
When you deploy or update a job, it flows through these states:
pending → queued → deploying → running
↓
rollout_paused (manual pause)
↓
rollout_failed (health checks failed)
After a job reaches running, it can transition to degraded if health issues pop up:
running ↔ degraded (daemon jobs only)
State Descriptions
Here's what each state means and when you'll see it.
pending: Your job is created but hasn't been scheduled yet. The orchestrator is preparing to start the deployment.
queued: The job is scheduled but there aren't any nodes available to run it. You'll see this when your fleet is at capacity or when no nodes match the job's selector labels.
deploying: The rollout is actively happening. The orchestrator is deploying the job to nodes, and executions are going through health validation.
During this state, the job is eligible for automatic rollback if health checks fail. This is the critical window where Expanso validates your deployment before it goes fully live.
running: Everything's healthy and stable. The rollout completed successfully, and your job is processing data normally across the fleet.
rollout_paused: You manually paused the rollout. The deployment is frozen—no new nodes get updated, but nodes that are already running the job keep working.
You can resume the rollout whenever you're ready, or roll back to the previous version if something looks wrong.
rollout_failed: Health checks failed during deployment, and Expanso automatically rolled back the job. It won't make further deployment progress until you fix the issue and redeploy.
This happens when the orchestrator detects that executions aren't passing health validation during the deploying state.
degraded: Your job was running fine, but some executions have become unhealthy. This only applies to daemon-type jobs that continuously retry failed executions.
The orchestrator monitors degraded jobs but doesn't automatically roll back, since the deployment was previously validated. Degradation usually means there's an issue with specific nodes or data sources, not the job configuration itself.
completed: The job finished successfully. This only applies to batch-type jobs that have a defined end state.
failed: The job failed—typically this means all executions failed or the job hit a terminal error it couldn't recover from.
stopped: You explicitly stopped the job using the API or CLI. The job won't restart until you start it again.
Rollout Completion Semantics
Here's a gotcha to watch out for: the job's status.rollout.completed_at timestamp tells you when a rollout finished—not the job state.
A rollout is complete when completed_at has a non-null value, regardless of whether the job state is running, degraded, or rollout_failed.
When you're checking if a rollout is still active, use completed_at == null to detect it—don't rely on the job state.
Daemon vs. Ops Jobs
State transitions work differently depending on your job type.
Daemon jobs run continuously and process data indefinitely:
- They can transition between
runninganddegradedbased on ongoing health - Edge nodes retry failed executions with exponential backoff
- The orchestrator monitors degraded daemon jobs but doesn't immediately roll them back
Ops jobs are one-time operations that finish when the work is done:
- They transition from
runningtocompletedorfailedwhen finished - They don't use the
degradedstate since they're not continuously retrying - There's no automatic rollback after completion—if you need to undo changes, use manual rollback
How Deployments Work
Execution State Lifecycle
When you deploy a new version of a job with deployment configuration, executions go through an extended lifecycle:
Pending → Starting → Validating → Running
↓
Failed → Orchestrator Rollback
↓
Degraded (retrying)
Here's what each state means:
Pending means the execution is scheduled but not started yet. Starting means it's being initialized. Validating is where things get interesting—the execution is running but undergoing health validation, so it's still eligible for rollback. Running means the execution passed health checks and is stable. Failed means the execution is terminal (typically for ops-type pipelines). Degraded means the execution is unhealthy but retrying (typically for daemon-type pipelines).
The Validating state is unique to deployments. During this window, health checks evaluate the execution using consecutive time intervals, a deadline limits how long validation can take, and if validation fails, the orchestrator initiates rollback. Once an execution reaches Running state, it's marked as stable.
When a daemon-type pipeline encounters failures, edge nodes mark the execution as Degraded and retry locally with backoff. The orchestrator monitors these Degraded executions—if too many nodes fail, it triggers a coordinated rollback across your fleet.
Health Check Configuration
When you deploy a new job, you want to make sure it's actually working before it takes over from the previous version. Health checks let Expanso watch your deployment as it starts up, validate that it's processing data without errors, and automatically roll back if something goes wrong.
Expanso uses window-based health evaluation—it watches your job over consecutive time intervals and counts how many are healthy vs. unhealthy. This prevents both false positives (one lucky success) and false negatives (one transient error) from affecting your deployment.
Here's how to configure them:
name: my-api-service
type: pipeline
config:
# ... your pipeline config ...
rollout:
health_check:
interval: 10s # Evaluation window duration (default: 10s)
success_threshold: 2 # Consecutive healthy windows needed (default: 2)
failure_threshold: 3 # Consecutive unhealthy windows before rollback (default: 3)
max_error_rate: 0.10 # Maximum error rate per window (default: 0.10 = 10%)
deadline: 5m # Maximum time to wait for validation
# Other rollout options...
selector:
match_labels:
env: production
interval (default: 10s) sets how long each health evaluation window lasts. The system calculates error rates per interval, not over the lifetime of the execution.
success_threshold (default: 2) is how many consecutive healthy intervals you need before the execution is validated and marked as stable. This prevents transient successes from passing validation.
failure_threshold (default: 3) is how many consecutive unhealthy intervals trigger a rollback. This prevents transient errors from causing unnecessary rollbacks.
max_error_rate (default: 0.10) sets the maximum error rate allowed within an interval. If errors exceed 10% of processed messages in a window, that window is marked unhealthy.
deadline (default varies) sets the maximum time allowed for the execution to achieve validation. If this expires without reaching the success threshold, the execution fails and triggers rollback.
How It Works
When your execution starts, it enters the Starting state. Once it begins running, it transitions to Validating.
During validation, the health check evaluates each interval. If the error rate is below max_error_rate, the window is healthy. Otherwise it's unhealthy. Idle windows with no traffic don't count toward either threshold.
After success_threshold consecutive healthy windows, the execution transitions to Running (stable). After failure_threshold consecutive unhealthy windows, it's marked as Failed and triggers rollback.
If the deadline expires without validation, the execution is also marked as Failed and rolls back.
How Health Evaluation Works
During validation, Expanso checks each pipeline component separately to pinpoint exactly what's failing. Instead of getting a vague "pipeline unhealthy" message, you'll know which specific input, processor, or output has a problem.
Component-by-Component Evaluation
Expanso applies different health checks depending on the component type:
Inputs (e.g., input.kafka.0, input.http_server.0):
- Checks connection errors:
input_connection_failedandinput_connection_lostmetrics - Connection failures mark the input as unhealthy
- Error rate checks don't apply—inputs receive data, they don't process it
Processors (e.g., processor.mapping.0, processor.filter.1):
- Checks error rate: ratio of
processor_errortoprocessor_receivedmessages - If error rate exceeds
max_error_ratethreshold, the processor is unhealthy - Connection checks don't apply—processors don't maintain connections
Outputs (e.g., output.kafka.0, output.http_client.0):
- Checks both connection errors AND error rate
- Connection metrics:
output_connection_failedandoutput_connection_lost - Error rate:
output_error/output_sent - Either connection failures or high error rate marks the output as unhealthy
Fail-Fast Strategy
Expanso checks components in order and stops at the first unhealthy one. This fail-fast approach gives you:
- Precise diagnostics: You know exactly which component failed (e.g., "Kafka Output [kafka]: connection failed")
- Early detection: No need to wait for all components to be evaluated
- Clear root cause: Instead of "pipeline unhealthy," you get "Data Filter [bloblang]: error rate 15.2% exceeds threshold 10.0%"
Activity-Based Evaluation
Components with no traffic during a window become Pending (not Unhealthy). This makes sense during:
- Startup: Pipeline is initializing, connections are establishing
- Low traffic: Waiting for data to arrive
- Idle intervals: Between bursts of activity
Pending windows don't count toward consecutive healthy or unhealthy thresholds. This prevents idle periods from triggering unnecessary rollbacks.
Health Check Lifecycle
For each interval window (default: 10 seconds):
- Metrics collection: OTel exporter forwards component-specific metrics to the health tracker
- Window evaluation: When the interval ends, Expanso evaluates each component:
- Check for activity (any traffic?)
- Check connection errors (inputs and outputs)
- Check error rate (processors and outputs)
- Consecutive tracking:
- Healthy window → increment consecutive healthy count, reset unhealthy count
- Unhealthy window → increment consecutive unhealthy count, reset healthy count
- Pending window → don't change either count
- State transition:
- If consecutive healthy count reaches
success_threshold→ transition to Running (stable) - If consecutive unhealthy count reaches
failure_threshold→ transition to Failed (triggers rollback) - If deadline expires → transition to Failed (triggers rollback)
- If consecutive healthy count reaches
Continuous Health Monitoring
Health monitoring doesn't stop after your deployment passes initial validation. Expanso continuously monitors health throughout the entire execution lifecycle—whether the execution is in Validating, Running, or Degraded state.
This continuous approach ensures that stable deployments stay tracked for health degradation, late-joining nodes get properly validated before being marked as stable, and degraded executions must prove they're healthy before returning to Running state.
How Continuous Monitoring Works
During Validating State:
When an execution is first validating, health checks evaluate it using the configured deadline. If health is proven (HealthHealthy), the execution transitions to Running and gets a StableAt timestamp. If health fails (HealthUnhealthy), the execution transitions to Degraded for daemon jobs or Failed for ops jobs.
Here's where it gets interesting: if the deadline expires while health is still HealthPending, the system gives "benefit of the doubt" and marks the execution as Running. This deadline-based benefit of doubt helps deployment waves progress even when data traffic is low or health checks haven't reported yet.
During Running State (Stable):
Once an execution has a StableAt timestamp, it's considered stable—but Expanso keeps monitoring. If health becomes HealthUnhealthy, the execution transitions to Degraded. There's no deadline here—the execution stays Running until it's proven unhealthy. This prevents stable deployments from being unnecessarily marked as degraded during idle periods.
During Running State (Late-Joiners):
Late-joiners are nodes that come online after the initial deployment. These executions don't have a StableAt timestamp yet, and they're handled differently. They wait indefinitely for health to be proven—there's no deadline pressure and no "benefit of doubt" for late-joiners. They must achieve HealthHealthy status before getting a StableAt timestamp. This ensures late-arriving nodes don't get marked as stable without validation.
During Degraded State:
Degraded executions must prove HealthHealthy to transition back to Running. There's no "benefit of doubt" for recovery—health must be explicitly proven. If health remains HealthUnhealthy or HealthPending, the execution stays Degraded. Edge nodes continue retrying with exponential backoff while reporting Degraded state, preventing executions from bouncing between states without actual recovery.
Late-Joiner Behavior
A "late-joiner" is an edge node that comes online after a job deployment has already rolled out to other nodes. This commonly happens when you add a new edge node to your fleet after a job is deployed, an existing node was offline during deployment and reconnects later, or an edge node restarts and needs to catch up with the current job version.
How Late-Joiners Are Handled:
When a late-joiner execution starts, it's placed directly into the Running state (bypassing Validating). However, it doesn't immediately get a StableAt timestamp—it must prove health first.
The key difference from initial deployment validation is that late-joiners wait indefinitely for health proof. There's no deadline, and there's no "benefit of doubt" if health data is pending. The execution must explicitly achieve HealthHealthy status before receiving a StableAt timestamp. This conservative approach ensures that late-arriving nodes are fully validated before being considered part of your stable fleet, even if they join long after the initial deployment wave.
Example Timeline:
Here's what this looks like in practice. Job v2 is deployed at 10:00 AM to 50 nodes. All 50 nodes validate health and reach Running (stable) state by 10:05 AM. A new node joins your fleet at 2:00 PM and receives Job v2. The new node starts the execution in Running state (no StableAt yet), and the system continuously monitors health without any deadline. Once HealthHealthy is confirmed, the execution gets its StableAt timestamp, and the node is now fully validated and part of your stable deployment.
State Transition Summary
Here's how executions transition between states based on health monitoring:
| Current State | Health Status | Deadline Expired? | Transition |
|---|---|---|---|
| Validating | HealthHealthy | N/A | → Running + StableAt |
| Validating | HealthUnhealthy | N/A | → Degraded (daemon) or Failed (ops) |
| Validating | HealthPending | Yes | → Running + StableAt (benefit of doubt) |
| Validating | No Runtime | Yes | → Degraded (daemon) or Failed (ops) |
| Running (stable) | HealthUnhealthy | N/A | → Degraded |
| Running (stable) | HealthHealthy or HealthPending | N/A | No change (stays Running) |
| Running (late-joiner) | HealthHealthy | N/A | Set StableAt (now fully stable) |
| Running (late-joiner) | HealthUnhealthy | N/A | → Degraded |
| Running (late-joiner) | HealthPending | N/A | Wait indefinitely (no benefit of doubt) |
| Degraded | HealthHealthy | N/A | → Running |
| Degraded | HealthUnhealthy or HealthPending | N/A | Stay Degraded (no benefit of doubt) |
The deadline only applies during the Validating state to help initial deployment waves progress. The "benefit of doubt" is only given during Validating—allowing low-traffic deployments to proceed when health data hasn't arrived yet. Late-joiners and degraded executions must prove health explicitly before transitioning to a stable state.
New Nodes During Active Deployments
When you scale your edge infrastructure by adding new nodes, the scheduler needs to decide whether to deploy jobs to them. This decision depends on the current deployment state.
The behavior here is different from late-joiners—nodes that come online after a deployment has already finished. During an active deployment, the scheduler protects you from accidentally deploying problematic versions to fresh capacity.
When New Nodes Get Jobs
Your new nodes will receive job deployments in most cases:
- No active deployment: Normal scheduling applies. New nodes get the current job version immediately.
- In-progress deployment: New nodes join the deployment and receive the new version as part of the rollout.
- Completed deployment: The deployment is done, so normal scheduling resumes.
- Cancelled deployment: The deployment was aborted, so normal scheduling resumes.
In these states, your infrastructure scales normally. New capacity comes online without manual intervention, and you don't need to do anything special.
When New Nodes Are Blocked
There are two deployment states where new nodes will NOT receive job deployments:
- Paused deployment: The deployment is frozen, waiting for your action. New nodes remain idle until you resume or cancel the deployment.
- Failed deployment: The deployment failed health checks and triggered a rollback. New nodes remain idle until you resolve the issue and resume or cancel the deployment.
This blocking behavior is intentional—it prevents new nodes from receiving potentially unstable job versions.
Here's why this matters: if you're investigating a deployment issue or waiting for a maintenance window, you don't want fresh capacity automatically picking up a problematic version. The scheduler holds back new nodes until you explicitly decide how to proceed.
Unblocking New Nodes
If you add nodes during a paused or failed deployment, you'll need to take action to unblock them:
-
Resume the rollout if you want to continue rolling out the new version:
curl -X POST https://api.expanso.io/jobs/{job-id}/rollout/resume \
-H "Authorization: Bearer $TOKEN" -
Cancel the rollout if you want to stop the rollout and let new nodes get the stable version:
curl -X POST https://api.expanso.io/jobs/{job-id}/rollout/cancel \
-H "Authorization: Bearer $TOKEN" -
Roll back if you want to revert all nodes (including new ones) to the previous version:
curl -X POST https://api.expanso.io/jobs/{job-id}/rollback \
-H "Authorization: Bearer $TOKEN"
Check your fleet status to see if nodes are waiting for jobs. Idle nodes during an active deployment might indicate a paused or failed state that needs attention.
Job State Summary
The scheduler's behavior for new nodes depends on the current job state:
| Job State | New Nodes Get Job? | What To Do |
|---|---|---|
pending, queued, running, completed, stopped | ✅ Yes | Nothing—normal scheduling |
deploying | ✅ Yes | Nothing—nodes join the rollout |
rollout_paused | ❌ No | Resume or roll back the rollout |
rollout_failed | ❌ No | Investigate, then redeploy or roll back |
degraded | ✅ Yes | Monitor—job is running but unhealthy |
When a job is in rollout_paused or rollout_failed state, the scheduler blocks new nodes from receiving the job. This prevents potentially unstable versions from deploying to fresh capacity while you're investigating issues.
Execution States and Failure Handling
When deployments fail health checks, edge nodes report their execution state to the orchestrator. Understanding these states helps you monitor deployments and troubleshoot issues.
Execution States
Running: The execution passed health validation and is stable. This is the healthy, steady state for a deployment.
Degraded: The execution is unhealthy but retrying. This typically happens with daemon-type pipelines (like continuous data processing) that can recover from transient failures. Edge nodes retry locally with 30-second backoff intervals, giving the pipeline time to recover.
Failed: The execution is terminal and won't retry. This typically happens with ops-type pipelines or when the orchestrator explicitly stops a failed deployment.
Validating: The execution is running but still undergoing health checks. It hasn't yet proven stable, so it's still eligible for rollback if health checks fail.
How Edge Nodes Handle Failures
Edge nodes don't decide when to roll back—they report state to the orchestrator, which makes all rollback decisions.
When a daemon pipeline fails health checks, the edge node marks it as Degraded and retries locally with exponential backoff (starting at 30 seconds). The pipeline keeps attempting to recover while reporting its degraded state to the orchestrator.
When an ops pipeline fails, the edge node marks it as Failed (terminal state) and reports this to the orchestrator. There's no local retry since ops pipelines are typically one-time operations.
How the Orchestrator Decides Rollback
The orchestrator monitors execution states across your entire edge fleet. When calculating deployment health, it counts both Failed and Degraded executions as failures.
If the number of failed/degraded nodes exceeds your deployment's health threshold, the orchestrator initiates a coordinated rollback. The orchestrator atomically:
- Updates the job spec to the previous version's configuration
- Cancels the failed deployment (marked as failed with rollback message)
- Creates a new deployment with immediate strategy using the rolled-back spec
- Increments the job version (e.g., v1 → v2 fails → v3 with v1's spec)
This coordinated approach ensures all nodes return to a known-good configuration together, even if some nodes are retrying locally.
The system tracks when executions become stable to prevent rollback loops. When validation succeeds, the execution gets a StableAt timestamp. If a previously-validated execution degrades later, it won't trigger another rollback—it uses normal retry instead.
This prevents scenarios where version A is stable, version B fails and rolls back (creating v3 with A's spec), and then v3 also fails, which would otherwise trigger endless rollback loops.
First Deployment Behavior
For the first deployment of a job (when no previous version exists), the rollback behavior is different. If the deployment fails validation, the orchestrator stops the execution since there's no healthy version to roll back to.
This prevents infinite retry loops when the first deployment fails validation. If there's no healthy version, the system stops rather than looping.
Manual Deployment Control
Expanso automatically handles deployment validation and rollback, but you can also take manual control through job-scoped APIs. Just provide the job name or ID—no deployment ID needed. You can pause, resume, or roll back deployments on demand.
When to Use Manual Controls
Pause stops a rolling deployment temporarily. The current wave freezes—no new nodes get updated, but already-updated nodes keep running. Use this to investigate issues, coordinate with other systems, or wait for a maintenance window.
Resume continues a paused rollout. The deployment picks up where it left off and progresses through the remaining waves.
Roll back reverts to the previous version immediately. This is useful when you spot critical issues before automatic health checks trigger, or when you want to roll back during a maintenance window to minimize user impact.
Job-Scoped Rollout APIs
Expanso provides job-scoped APIs to control rollouts. You only need the job name:
POST /jobs/{id}/rollout/pausePOST /jobs/{id}/rollout/resumePOST /jobs/{id}/rollback
All endpoints require authentication and support dry-run mode and reason tracking.
Pausing a Rollout
Pause an active rollout to temporarily stop its progress:
curl -X POST https://cloud.expanso.io/api/v1/jobs/my-job/rollout/pause \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Pausing for investigation"
}'
Response:
{
"job_id": "job-abc123",
"deployment_id": "dep-xyz789"
}
Dry run mode lets you check if the operation would succeed without actually pausing:
curl -X POST https://cloud.expanso.io/api/v1/jobs/my-job/rollout/pause \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Checking if pause is possible",
"dry_run": true
}'
Error cases:
- 409 Conflict: No active deployment to pause, or deployment is already paused
- 404 Not Found: Job doesn't exist
- 400 Bad Request: Deployment is in a terminal state (completed, failed, canceled)
Via CLI:
You can also pause using the CLI:
# Pause rollout by job name
expanso-cli job pause-rollout my-job
# Pause with a reason
expanso-cli job pause-rollout my-job --reason "investigating performance issue"
# Dry run to check if pause is possible
expanso-cli job pause-rollout my-job --dry-run
Resuming a Deployment
Resume a paused deployment to continue the rollout:
curl -X POST https://cloud.expanso.io/api/v1/jobs/my-job/rollout/resume \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Investigation complete, resuming rollout"
}'
Response:
{
"job_id": "job-abc123",
"deployment_id": "dep-xyz789"
}
The deployment continues from where it was paused—if it was in the middle of wave 2, it completes wave 2 and then proceeds to wave 3.
Via CLI:
You can also resume using the CLI:
# Resume rollout by job name
expanso-cli job resume-rollout my-job
# Resume with a reason
expanso-cli job resume-rollout my-job --reason "issue resolved, continuing deployment"
# Dry run to check if resume is possible
expanso-cli job resume-rollout my-job --dry-run
Rolling Back a Job
Roll back a job to a previous version. This cancels any active deployment and creates a new immediate deployment with the previous version's configuration:
# Roll back to previous version (automatic)
curl -X POST https://cloud.expanso.io/api/v1/jobs/my-job/rollback \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Manual rollback due to increased error rate"
}'
Response:
{
"job_id": "job-abc123",
"deployment_id": "dep-new456",
"from_version": 3,
"to_version": 4,
"target_spec": 2,
"created_new": true,
"cancelled_prior": true,
"warnings": []
}
Roll back to a specific version by providing the version number:
curl -X POST "https://cloud.expanso.io/api/v1/jobs/my-job/rollback?version=1" \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Rolling back to last known good version"
}'
Understanding the response:
| Field | Description |
|---|---|
from_version | The version you're rolling back from (e.g., 3) |
to_version | The new version number after rollback (e.g., 4) - increments because the job spec is updated |
target_spec | The version whose configuration you're rolling back to (e.g., 2) |
created_new | Always true - rollback creates a new immediate deployment |
cancelled_prior | True if there was an active deployment that got canceled |
warnings | Any non-fatal issues encountered |
Version semantics:
When you roll back from version 3 to version 2's configuration:
- Current deployment (version 3) is canceled
- Job spec is reverted to version 2's configuration
- Job version increments to 4 (new spec version)
- New immediate deployment is created (version 4 with version 2's config)
This means "rollback" creates a new version with the old configuration, not a true rewind.
Error cases:
- 400 Bad Request: Already at first version (no previous version to roll back to)
- 404 Not Found: Target version doesn't exist in version history
- 409 Conflict: Cannot roll back in current state
Common Patterns
Pause for investigation:
# 1. Pause rollout
curl -X POST https://cloud.expanso.io/api/v1/jobs/api-service/rollout/pause \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-d '{"reason": "Investigating error spike"}'
# 2. Investigate (check logs, metrics, etc.)
expanso-cli job logs api-service --level error
# 3. Resume if OK, or roll back if not
curl -X POST https://cloud.expanso.io/api/v1/jobs/api-service/rollout/resume \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-d '{"reason": "False alarm, continuing rollout"}'
Emergency rollback:
# Immediately roll back when you spot critical issues
curl -X POST https://cloud.expanso.io/api/v1/jobs/payment-processor/rollback \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-d '{"reason": "Critical: payment failures detected"}'
Manual vs Automatic Control
When to use manual control:
- You spot critical issues before automatic health checks fail
- You want to control rollout timing during a maintenance window
- You're seeing early warning signs (increased latency, memory usage)
- You need immediate action for business reasons
- You want to pause to coordinate with other systems
When to rely on automatic behavior:
- Normal deployment operations with proper health check configuration
- Rolling out to large fleets where manual intervention isn't practical
- When you want consistent behavior based on objective health metrics
- During off-hours when no one is actively monitoring
Manual Rollback
Something's wrong with your deployment. Maybe you're seeing increased error rates, or users are reporting issues—and you don't want to wait for automatic rollback thresholds to kick in. You can manually trigger a rollback using the Expanso API or CLI to immediately revert to the previous version.
When to Manually Roll Back
You're seeing critical issues before automatic thresholds trigger. Your monitoring shows a spike in errors or performance degradation right after a deployment. Rather than wait for automatic health checks to fail, you want to roll back immediately to minimize impact.
Early warning signs indicate problems. You notice unusual patterns—increased latency, memory usage climbing, or warning messages in logs. The deployment hasn't technically failed health checks yet, but you want to roll back during your maintenance window before users are affected.
Triggering a Rollback
Let's walk through how to trigger a rollback using both the API and CLI.
Via API:
curl -X POST https://cloud.expanso.io/api/v1/jobs/{job-id}/rollback \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"reason": "Rolling back due to increased error rate"
}'
Via CLI:
Trigger the rollback directly on the job:
# Rollback to previous version
expanso-cli job rollback my-job --reason "Manual rollback due to errors"
# Or rollback to a specific version
expanso-cli job rollback my-job --version 3 --reason "Rollback to known good version"
Want to preview the rollback first? Use dry run mode to see what would happen without actually triggering it:
curl -X POST https://cloud.expanso.io/api/v1/jobs/{job-id}/rollback \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"dry_run": true
}'
This returns what would happen—which version you'd roll back to, any warnings, and whether the rollback can proceed—without actually making changes.
What Happens During Rollback
Once you trigger a rollback, here's what happens behind the scenes:
First, the system validates that rollback is possible. The deployment can't be in a terminal state like completed or canceled, and there must be a previous version to roll back to (you can't roll back your first deployment—there's nothing to roll back to).
The orchestrator atomically performs the rollback in a single transaction:
- Updates the job spec to the target version's configuration
- Cancels the active deployment (if one exists) and marks it as failed
- Creates a new immediate deployment with the rolled-back spec
- Increments the job version to create a new version number
The new deployment rolls out immediately using the immediate deployment strategy. All nodes receive the rolled-back configuration at once, ensuring consistent state across your fleet.
The job version increments even though you're using an older spec. For example: v1 → v2 (fails) → v3 (with v1's spec). This preserves a clear audit trail showing the rollback occurred.
Understanding the Rollback Response
The rollback API returns information about what happened:
{
"job_id": "job-abc123",
"deployment_id": "deploy-new456",
"from_version": 3,
"to_version": 4,
"target_spec": 2,
"created_new": true,
"cancelled_prior": true,
"warnings": []
}
| Field | Description |
|---|---|
job_id | ID of the job that was rolled back |
deployment_id | ID of the new rollback deployment that was created |
from_version | Version you're rolling back from (e.g., 3) |
to_version | New version number after rollback (e.g., 4) - always increments |
target_spec | Version whose spec was used for rollback (e.g., 2) |
created_new | Always true - rollback creates a new deployment |
cancelled_prior | True if an active deployment was cancelled |
warnings | Any non-fatal issues encountered during rollback |
Rollback Limitations
Not all deployments can be rolled back. Here's what you need to know:
| Can Roll Back | Cannot Roll Back |
|---|---|
| Failed deployments | First deployment (no previous version exists) |
| In-progress deployments | Already completed deployments |
| Pending deployments | Already canceled deployments |
Once a deployment successfully completes, it's considered stable. To revert to an earlier version, create a new deployment targeting that version rather than rolling back the completed one.
Monitoring Rollback Progress
After triggering a rollback, you'll want to watch it progress. Here's how to monitor the state changes:
# Check the job rollout status
expanso-cli job describe my-job
# Watch execution states in real-time
expanso-cli execution list --job my-job --watch
# Verify executions at the rollback version are coming online
expanso-cli execution list --job my-job --version <to-version>
What to look for:
- Job status: Should show
rolling rollback (in progress)in the ROLLOUT section - Rollback Version: Indicates the target version for the rollback
- Progress: Shows how many nodes have been rolled back
- Executions: Should show the rollback version starting across all nodes
Depending on your fleet size and network conditions, rollback can take several minutes. Edge nodes need to stop current executions and start new executions with the rolled-back configuration. This is normal—monitor progress and ensure nodes are transitioning as expected.
Configuration Examples
Basic Deployment with Health Checks
name: data-processor
type: pipeline
config:
input:
http_server:
path: /ingest
pipeline:
processors:
- mapping: |
root.processed = true
root.timestamp = now()
output:
kafka:
addresses: ["kafka:9092"]
topic: processed-data
rollout:
health_check:
interval: 10s
success_threshold: 2
failure_threshold: 3
max_error_rate: 0.10
deadline: 2m
selector:
match_labels:
env: production
region: us-west
Behavior:
- Health is evaluated every 10 seconds (default)
- Needs 2 consecutive healthy windows to be validated (20 seconds minimum)
- 3 consecutive unhealthy windows trigger rollback (30 seconds of sustained errors)
- If not healthy within 2 minutes, rollback occurs
- Automatic rollback to previous version if validation fails
Conservative Deployment for Critical Services
name: critical-api
type: pipeline
config:
# ... pipeline config ...
rollout:
health_check:
interval: 30s # Longer evaluation windows
success_threshold: 4 # More consecutive healthy windows
failure_threshold: 2 # Fail faster on errors
max_error_rate: 0.05 # Stricter error rate (5%)
deadline: 10m # More time to stabilize
selector:
match_labels:
service: critical
env: production
Behavior:
- Health evaluated every 30 seconds (longer windows for stability)
- Requires 4 consecutive healthy windows (2 minutes minimum)
- Only 2 unhealthy windows needed to trigger rollback (fail fast)
- Stricter error threshold of 5% instead of default 10%
- More conservative approach for critical infrastructure
Fast-Paced Deployment for Non-Critical Services
name: log-aggregator
type: pipeline
config:
# ... pipeline config ...
rollout:
health_check:
interval: 5s # Faster evaluation
success_threshold: 2 # Default threshold
failure_threshold: 3 # Default threshold
deadline: 1m # Fast feedback
selector:
match_labels:
criticality: low
Behavior:
- Health evaluated every 5 seconds (faster feedback)
- Only needs 2 consecutive healthy windows (10 seconds minimum)
- Fails fast (1 minute deadline) for quick iteration
- Suitable for non-critical workloads where speed matters
Monitoring Deployments
Once you've kicked off a job deployment, you'll want to track how it's progressing across your edge nodes. Expanso gives you several ways to monitor rollout status, check which nodes have completed validation, and identify any issues that come up during the deployment process.
Check Job Status
To see the overall rollout status and progress for a job:
expanso-cli job describe data-processor
When a rollout is active, you'll see a ROLLOUT section with detailed progress information:
ROLLOUT
-------
Status: rolling (in progress)
Progress: 75% (3/4 nodes)
Wave: 2/3
Failed Nodes: 0
Health Check: 5m deadline, 10% max error rate
Started: 2025-01-03T10:15:30Z
The ROLLOUT section shows you everything you need to know about the deployment. The Status field tells you which rollout strategy is running (rolling or immediate) and what state it's in—in progress, paused, failed, or completed. Progress shows the percentage complete along with how many nodes have been updated out of the total. For rolling deployments, Wave indicates which wave you're currently in and how many total waves there are. If any nodes fail their health checks, you'll see a Failed Nodes count. The Health Check line summarizes your validation configuration, and Started shows when the rollout began. Once the rollout finishes, you'll also see a Completed timestamp.
When a job is actively running a rollback, the Status field shows the rollback type and which version you're rolling back to:
Status: rolling rollback (in progress)
Rollback Version: 42
If the rollout is paused or has failed, you'll also see the Rollback Version field indicating the target version for the rollback.
Check Job List
To get a quick overview of all jobs and their rollout status:
expanso-cli job list
Jobs that are actively running a rollback operation show a (rollback) indicator in the VERSION column:
NAME TYPE STATE VERSION LABELS
data-processor pipeline deploying 43 (rollback) env=prod
log-aggregator pipeline running 15 env=prod
You'll see the (rollback) indicator when the rollout type is "rollback" and the job state is deploying, rollout_paused, or rollout_failed.
Check Execution State
To monitor execution states and see how validation is progressing on individual nodes:
# List executions for a job
expanso-cli execution list --job data-processor
# Sample output:
# EXECUTION ID STATE NODE STARTED
# exec-abc-123 validating edge-node-01 2m ago
# exec-xyz-789 running edge-node-02 15m ago (stable)
The execution state tells you where each node is in the validation process. When you see validating, that execution is currently in its validation window. Once it passes validation, the state changes to running and gets marked as stable. If an execution hits failed, that means it failed validation and will trigger (or has triggered) a rollback.
View Execution Details
For detailed information about a specific execution:
expanso-cli execution describe exec-abc-123
The output now includes rollout-specific fields:
EXECUTION DETAILS
-----------------
ID: exec-abc-123
Job: data-processor
Version: 43
State: validating
Node: edge-node-01
Created: 2025-01-03T10:20:45Z
Updated: 2025-01-03T10:21:15Z
Rollout Wave: 2
Once the execution passes validation and becomes stable, you'll see additional information:
EXECUTION DETAILS
-----------------
ID: exec-xyz-789
Job: data-processor
Version: 43
State: running
Node: edge-node-02
Created: 2025-01-03T10:15:30Z
Updated: 2025-01-03T10:16:45Z
Rollout Wave: 1
Stable At: 2025-01-03T10:16:45Z
The Rollout Wave field shows which wave this execution belongs to (only displayed if greater than 0). Once validation completes, you'll see Stable At with the timestamp when the execution passed its health checks. Use these fields to track validation progress and identify which executions have completed the health check window.
View Health Diagnostics
When a deployment fails validation, detailed diagnostics help you pinpoint exactly what went wrong. Let's look at the health information available in execution status.
Check execution health:
expanso-cli execution describe exec-abc-123
When you describe an execution, you'll see these health diagnostic fields:
status: Pending, Healthy, or Unhealthymessage: Human-readable diagnostic showing component label and type (e.g., "Data Filter [bloblang]: error rate 15.2% exceeds threshold 10.0%")details: Machine-parseable metadata including:component_id: UUID from visual builder (if set)component_label: User-friendly component name (if set)component_name: Benthos component type (e.g., "bloblang", "kafka", "http_client")component_type: Component category (input, processor, or output)component_path: Technical Benthos path (e.g., "root.pipeline.processors.0")failure_type: Reason for failure (connection_failed, connection_lost, or error_rate)consecutive_unhealthy_windows: How many consecutive windows were unhealthyerror_count: Number of errors in the failing componentmessage_count: Number of messages processederror_rate: Calculated error rate percentageerror_message: Most recent error message from the component
You should see output like:
Health Status: Unhealthy
Message: Kafka Output [kafka]: connection failed 3 times (3 consecutive unhealthy windows)
Details:
component_id: uuid-output-kafka-1
component_label: Kafka Output
component_name: kafka
component_type: output
component_path: root.output
failure_type: connection_failed
consecutive_unhealthy_windows: 3
connection_failed_count: 3
error_message: dial tcp 10.0.1.5:9092: connect: connection refused
This level of detail lets you identify the exact component causing failures, understand whether it's a connection issue or processing error, see the actual error messages to diagnose root cause, and determine if the issue is transient or sustained.
Check for Rollback Events
# View execution history to see state transitions
expanso-cli execution history exec-abc-123
# Look for transitions like:
# Validating → Failed (triggers rollback)
# Then new deployment created with old spec
Best Practices
1. Set Appropriate Thresholds
success_threshold (default: 2 windows):
- Too low (1 window): Single lucky success passes validation, more false positives
- Too high (5+ windows): Delays safe rollouts, increases deployment time
- Recommendation:
- Use default (2) for most services
- Use 3-4 for critical services requiring extra confidence
- Use 1 for fast-iteration development environments
failure_threshold (default: 3 windows):
- Too low (1-2 windows): Transient errors cause unnecessary rollbacks
- Too high (5+ windows): Bad deployments stay active longer before rollback
- Recommendation:
- Use default (3) for most services
- Use 2 for critical services (fail faster)
- Use 4-5 for non-critical services (more tolerance)
interval (default: 10s):
- Too short (< 5s): Noisy evaluation, not enough data per window
- Too long (> 30s): Slower feedback, delays detection of issues
- Recommendation:
- Use default (10s) for most services
- Use 5s for fast-feedback development
- Use 20-30s for services with variable traffic patterns
2. Set Realistic Deadlines
Deadline should account for:
- Minimum validation time:
success_threshold * interval(e.g., 2 × 10s = 20s) - Startup time (loading configs, establishing connections)
- Initialization delays (cache warming, etc.)
- Traffic ramp-up time (if no traffic = no health evaluation)
Formula:
Deadline = (success_threshold × interval) + StartupTime + Buffer
Example:
success_threshold: 3
interval: 10s
startup_time: 15s
buffer: 30s
Deadline = (3 × 10s) + 15s + 30s = 75s ≈ 2m
3. Test Rollback Scenarios
Verify rollback works before production:
# Deploy a job version that will fail
expanso-cli job deploy broken-version.yaml
# Watch for automatic rollback creating new deployment
expanso-cli deployment list --job my-job --watch
# Verify job version incremented with old spec
expanso-cli job versions my-job
4. Monitor Deployment Metrics
Track these metrics:
- Validation success rate
- Time spent in validation state
- Rollback frequency
- Time to rollback
5. Handle First Deployments
For the first deployment of a job:
- No rollback target exists
- Failed validation uses normal retry logic
- Consider more conservative health check settings
Troubleshooting
Execution Stuck in Validating State
Symptoms: Execution remains in validating state for extended period
Diagnosis:
expanso-cli execution describe <exec-id>
# Check: created_at vs current time
# Check: health_check deadline configuration
Possible Causes:
- Execution is flapping (alternating healthy/unhealthy windows)
- Not receiving traffic (idle windows don't count toward success threshold)
- Deadline is very long
- Success threshold requirements not yet met
Solution:
- Check edge node logs for execution errors
- Verify execution is actually processing data (needs traffic for health evaluation)
- Look for alternating healthy/unhealthy patterns (indicates instability)
- Consider adjusting thresholds if too conservative
Rollback Not Occurring
Symptoms: Execution fails but doesn't trigger rollback
Diagnosis:
expanso-cli execution describe <failed-exec-id>
# Check: rollback_to_version field (should be > 0)
# Check: stable_at field (should be empty if new deployment)
# Check deployment state
expanso-cli deployment describe <deployment-id>
Possible Causes:
- First deployment (no previous version to roll back to)
- Execution was previously stable (has
stable_atset) - Job doesn't have deployment configuration
- Orchestrator rollback hasn't triggered yet (waiting for threshold)
Solution:
- Verify deployment configuration exists in job spec
- Check if previous version exists using
expanso-cli job versions <job> - Review execution history to see if it was previously validated
- For cluster-wide issues, check if orchestrator rollback is in progress
Manual Rollback Fails
Symptoms: Manual rollback API call returns error
Diagnosis:
# Try with dry_run to see what would happen
curl -X POST https://cloud.expanso.io/api/v1/jobs/{job-id}/rollback \
-H "Authorization: Bearer $EXPANSO_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"dry_run": true}'
Common Errors:
"rollback already in progress"
- Another rollback is in progress
- Wait for current rollback to complete
- Check job state:
expanso-cli job describe <job-id>
"rollout in terminal state"
- Rollout already completed or canceled
- Cannot roll back completed rollouts
- If you need to revert, deploy the previous version as a new deployment
"no previous version to rollback to"
- This is the first deployment (RollbackToVersion = 0)
- No previous version exists
- Stop the job and redeploy with fixes:
expanso-cli job stop <job-id>
Solution:
- Review error message for specific constraint violated
- Check deployment state and history
- Use dry run mode to validate before attempting rollback
First Deployment Fails Validation
Symptoms: First deployment of a job fails validation and execution stops (doesn't retry)
Diagnosis:
# Check if this is first deployment
expanso-cli job describe my-job
# If this is the first deployment, there's no previous version to roll back to
# Check execution state
expanso-cli execution describe <exec-id>
# Should show: state=stopped, message about no rollback target
Behavior:
- First deployment has no previous version to roll back to
- Edge nodes stop failed executions instead of retrying indefinitely
- This prevents infinite retry loops on permanently broken deployments
Solution:
- Fix the underlying issue (check validation failure cause)
- Deploy corrected version:
# Update job configuration
expanso-cli job deploy fixed-version.yaml - Consider more lenient health check thresholds for initial deployment:
rollout:
health_check:
failure_threshold: 5 # More tolerance for startup issues
deadline: 10m # More time to stabilize
Infinite Rollback Loops
Symptoms: Job keeps switching between versions
This should not happen due to StableAt protection, but if it does:
Diagnosis:
# Check execution history
expanso-cli execution history <exec-id-version-a>
expanso-cli execution history <exec-id-version-b>
# Look for stable_at timestamps
expanso-cli execution describe <exec-id>
Solution:
- Report as a bug - StableAt mechanism should prevent this
- Manually stop problematic job:
expanso-cli job stop <job-name> - Deploy a known-good version
Target Version Missing from History
Symptoms: Rollback fails with "target version not found" error
Diagnosis:
# Check what versions exist in history
expanso-cli job versions <job-name>
# Check if the target version spec exists
expanso-cli job describe <job-name> --version <target-version>
Possible Causes:
- Target version was never created (version gap in history)
- Job version history is incomplete or corrupted
- Requested version number doesn't exist
Solution:
- Verify the version exists in history:
expanso-cli job versions <job> - Check for gaps in version sequence
- If version history is corrupt, manually deploy a known-good configuration
- Use
--versionflag to specify a version that exists
When to Use Deployments
Use deployment configuration when:
- Updating production services that require high availability
- Deploying to critical edge infrastructure
- You want automatic safety nets for bad deployments
- You're rolling out changes across a large edge fleet
Skip deployment configuration when:
- Rapidly iterating in development environments
- Deploying one-off data processing jobs
- Managing system/operational jobs (cleanup tasks, etc.)
- Immediate updates are acceptable (no validation period needed)
Related Documentation
- Edge Node Deployment: How to deploy and manage edge nodes themselves
- Fleet Monitoring: Monitor deployment health across your fleet
- CLI Job Commands: Manage job deployments via CLI
- Execution States: Understanding execution lifecycle