Job Deployments with Health Checks and Rollback
When you update jobs running on edge nodes, Expanso validates new versions before fully committing to them. If health checks fail, the orchestrator automatically rolls back to the previous version to keep your edge infrastructure stable.
This guide explains how job deployments work, how to configure health checks, and how the rollback system keeps your edge infrastructure resilient.
Overview
When you deploy a standard job, the new version starts immediately. Deployment jobs work differently—they enter a validation period with health checks before becoming fully active. If validation fails, Expanso automatically reverts to the last working version.
This gives you automatic rollback when deployments fail, prevents bad deployments from destabilizing your edge fleet, and keeps things running even during failed updates.
Job State Lifecycle
When you deploy a job, it moves through different states as it rolls out to your edge fleet and starts processing data. Knowing these states helps you monitor deployments and debug issues when things go wrong.
Job States vs. Execution States
Expanso tracks two different kinds of states, and it's important to understand the difference:
- Job states show the overall status of your job across your entire fleet (like
deploying,running, ordegraded) - Execution states show what's happening with individual pipeline instances on specific nodes (like
starting,validating, orrunning)
The job state is what matters for monitoring—it reflects the deployment and health status of your job across all edge nodes. Execution states are implementation details that the orchestrator uses internally.
Job State Transitions
When you deploy or update a job, it flows through these states:
pending → queued → deploying → running
↓
rollout_paused (manual pause)
↓
rollout_failed (health checks failed)
After a job reaches running, it can transition to degraded if health issues pop up:
running ↔ degraded (daemon jobs only)
State Descriptions
Here's what each state means and when you'll see it.
pending: Your job is created but hasn't been scheduled yet. The orchestrator is preparing to start the deployment.
queued: The job is scheduled but there aren't any nodes available to run it. You'll see this when your fleet is at capacity or when no nodes match the job's selector labels.
deploying: The rollout is actively happening. The orchestrator is deploying the job to nodes, and executions are going through health validation.
During this state, the job is eligible for automatic rollback if health checks fail. This is the critical window where Expanso validates your deployment before it goes fully live.
running: Everything's healthy and stable. The rollout completed successfully, and your job is processing data normally across the fleet.
rollout_paused: You manually paused the rollout. The deployment is frozen—no new nodes get updated, but nodes that are already running the job keep working.
You can resume the rollout whenever you're ready, or roll back to the previous version if something looks wrong.
rollout_failed: Health checks failed during deployment, and Expanso automatically rolled back the job. It won't make further deployment progress until you fix the issue and redeploy.
This happens when the orchestrator detects that executions aren't passing health validation during the deploying state.
degraded: Your job was running fine, but some executions have become unhealthy. This only applies to daemon-type jobs that continuously retry failed executions.
The orchestrator monitors degraded jobs but doesn't automatically roll back, since the deployment was previously validated. Degradation usually means there's an issue with specific nodes or data sources, not the job configuration itself.
completed: The job finished successfully. This only applies to batch-type jobs that have a defined end state.
failed: The job failed—typically this means all executions failed or the job hit a terminal error it couldn't recover from.
stopped: You explicitly stopped the job using the API or CLI. The job won't restart until you start it again.
Rollout Completion Semantics
Here's a gotcha to watch out for: the job's status.rollout.completed_at timestamp tells you when a rollout finished—not the job state.
A rollout is complete when completed_at has a non-null value, regardless of whether the job state is running, degraded, or rollout_failed.
When you're checking if a rollout is still active, use completed_at == null to detect it—don't rely on the job state.
Daemon vs. Ops Jobs
State transitions work differently depending on your job type.
Daemon jobs run continuously and process data indefinitely:
- They can transition between
runninganddegradedbased on ongoing health - Edge nodes retry failed executions with exponential backoff
- The orchestrator monitors degraded daemon jobs but doesn't immediately roll them back
Ops jobs are one-time operations that finish when the work is done:
- They transition from
runningtocompletedorfailedwhen finished - They don't use the
degradedstate since they're not continuously retrying - There's no automatic rollback after completion—if you need to undo changes, use manual rollback
How Deployments Work
Execution State Lifecycle
When you deploy a new version of a job with deployment configuration, executions go through an extended lifecycle:
Pending → Starting → Validating → Running
↓
Failed → Orchestrator Rollback
↓
Degraded (retrying)
Here's what each state means:
Pending means the execution is scheduled but not started yet. Starting means it's being initialized. Validating is where things get interesting—the execution is running but undergoing health validation, so it's still eligible for rollback. Running means the execution passed health checks and is stable. Failed means the execution is terminal (typically for ops-type pipelines). Degraded means the execution is unhealthy but retrying (typically for daemon-type pipelines).
The Validating state is unique to deployments. During this window, health checks evaluate the execution using consecutive time intervals, a deadline limits how long validation can take, and if validation fails, the orchestrator initiates rollback. Once an execution reaches Running state, it's marked as stable.
When a daemon-type pipeline encounters failures, edge nodes mark the execution as Degraded and retry locally with backoff. The orchestrator monitors these Degraded executions—if too many nodes fail, it triggers a coordinated rollback across your fleet.
Health Check Configuration
When you deploy a new job, you want to make sure it's actually working before it takes over from the previous version. Health checks let Expanso watch your deployment as it starts up, validate that it's processing data without errors, and automatically roll back if something goes wrong.
Expanso uses window-based health evaluation—it watches your job over consecutive time intervals and counts how many are healthy vs. unhealthy. This prevents both false positives (one lucky success) and false negatives (one transient error) from affecting your deployment.
Here's how to configure them:
name: my-api-service
type: pipeline
config:
# ... your pipeline config ...
rollout:
health_check:
interval: 10s # Evaluation window duration (default: 10s)
success_threshold: 2 # Consecutive healthy windows needed (default: 2)
failure_threshold: 3 # Consecutive unhealthy windows before rollback (default: 3)
max_error_rate: 0.10 # Maximum error rate per window (default: 0.10 = 10%)
deadline: 5m # Maximum time to wait for validation
# Other rollout options...
selector:
match_labels:
env: production
interval (default: 10s) sets how long each health evaluation window lasts. The system calculates error rates per interval, not over the lifetime of the execution.
success_threshold (default: 2) is how many consecutive healthy intervals you need before the execution is validated and marked as stable. This prevents transient successes from passing validation.
failure_threshold (default: 3) is how many consecutive unhealthy intervals trigger a rollback. This prevents transient errors from causing unnecessary rollbacks.
max_error_rate (default: 0.10) sets the maximum error rate allowed within an interval. If errors exceed 10% of processed messages in a window, that window is marked unhealthy.
deadline (default varies) sets the maximum time allowed for the execution to achieve validation. If this expires without reaching the success threshold, the execution fails and triggers rollback.
How It Works
When your execution starts, it enters the Starting state. Once it begins running, it transitions to Validating.
During validation, the health check evaluates each interval. If the error rate is below max_error_rate, the window is healthy. Otherwise it's unhealthy. Idle windows with no traffic don't count toward either threshold.
After success_threshold consecutive healthy windows, the execution transitions to Running (stable). After failure_threshold consecutive unhealthy windows, it's marked as Failed and triggers rollback.
If the deadline expires without validation, the execution is also marked as Failed and rolls back.
How Health Evaluation Works
During validation, Expanso checks each pipeline component separately to pinpoint exactly what's failing. Instead of getting a vague "pipeline unhealthy" message, you'll know which specific input, processor, or output has a problem.
Component-by-Component Evaluation
Expanso applies different health checks depending on the component type:
Inputs (e.g., input.kafka.0, input.http_server.0):
- Checks connection errors:
input_connection_failedandinput_connection_lostmetrics - Connection failures mark the input as unhealthy
- Error rate checks don't apply—inputs receive data, they don't process it
Processors (e.g., processor.mapping.0, processor.filter.1):
- Checks error rate: ratio of
processor_errortoprocessor_receivedmessages - If error rate exceeds
max_error_ratethreshold, the processor is unhealthy - Connection checks don't apply—processors don't maintain connections
Outputs (e.g., output.kafka.0, output.http_client.0):
- Checks both connection errors AND error rate
- Connection metrics:
output_connection_failedandoutput_connection_lost - Error rate:
output_error/output_sent - Either connection failures or high error rate marks the output as unhealthy
Fail-Fast Strategy
Expanso checks components in order and stops at the first unhealthy one. This fail-fast approach gives you:
- Precise diagnostics: You know exactly which component failed (e.g., "Kafka Output [kafka]: connection failed")
- Early detection: No need to wait for all components to be evaluated
- Clear root cause: Instead of "pipeline unhealthy," you get "Data Filter [bloblang]: error rate 15.2% exceeds threshold 10.0%"
Activity-Based Evaluation
Components with no traffic during a window become Pending (not Unhealthy). This makes sense during:
- Startup: Pipeline is initializing, connections are establishing
- Low traffic: Waiting for data to arrive
- Idle intervals: Between bursts of activity
Pending windows don't count toward consecutive healthy or unhealthy thresholds. This prevents idle periods from triggering unnecessary rollbacks.
Health Check Lifecycle
For each interval window (default: 10 seconds):
- Metrics collection: OTel exporter forwards component-specific metrics to the health tracker
- Window evaluation: When the interval ends, Expanso evaluates each component:
- Check for activity (any traffic?)
- Check connection errors (inputs and outputs)
- Check error rate (processors and outputs)
- Consecutive tracking:
- Healthy window → increment consecutive healthy count, reset unhealthy count
- Unhealthy window → increment consecutive unhealthy count, reset healthy count
- Pending window → don't change either count
- State transition:
- If consecutive healthy count reaches
success_threshold→ transition to Running (stable) - If consecutive unhealthy count reaches
failure_threshold→ transition to Failed (triggers rollback) - If deadline expires → transition to Failed (triggers rollback)
- If consecutive healthy count reaches
Continuous Health Monitoring
Health monitoring doesn't stop after your deployment passes initial validation. Expanso continuously monitors health throughout the entire execution lifecycle—whether the execution is in Validating, Running, or Degraded state.
This continuous approach ensures that stable deployments stay tracked for health degradation, late-joining nodes get properly validated before being marked as stable, and degraded executions must prove they're healthy before returning to Running state.
How Continuous Monitoring Works
During Validating State:
When an execution is first validating, health checks evaluate it using the configured deadline. If health is proven (HealthHealthy), the execution transitions to Running and gets a StableAt timestamp. If health fails (HealthUnhealthy), the execution transitions to Degraded for daemon jobs or Failed for ops jobs.
Here's where it gets interesting: if the deadline expires while health is still HealthPending, the system gives "benefit of the doubt" and marks the execution as Running. This deadline-based benefit of doubt helps deployment waves progress even when data traffic is low or health checks haven't reported yet.
During Running State (Stable):
Once an execution has a StableAt timestamp, it's considered stable—but Expanso keeps monitoring. If health becomes HealthUnhealthy, the execution transitions to Degraded. There's no deadline here—the execution stays Running until it's proven unhealthy. This prevents stable deployments from being unnecessarily marked as degraded during idle periods.
During Running State (Late-Joiners):
Late-joiners are nodes that come online after the initial deployment. These executions don't have a StableAt timestamp yet, and they're handled differently. They wait indefinitely for health to be proven—there's no deadline pressure and no "benefit of doubt" for late-joiners. They must achieve HealthHealthy status before getting a StableAt timestamp. This ensures late-arriving nodes don't get marked as stable without validation.
During Degraded State:
Degraded executions must prove HealthHealthy to transition back to Running. There's no "benefit of doubt" for recovery—health must be explicitly proven. If health remains HealthUnhealthy or HealthPending, the execution stays Degraded. Edge nodes continue retrying with exponential backoff while reporting Degraded state, preventing executions from bouncing between states without actual recovery.
Late-Joiner Behavior
A "late-joiner" is an edge node that comes online after a job deployment has already rolled out to other nodes. This commonly happens when you add a new edge node to your fleet after a job is deployed, an existing node was offline during deployment and reconnects later, or an edge node restarts and needs to catch up with the current job version.
How Late-Joiners Are Handled:
When a late-joiner execution starts, it's placed directly into the Running state (bypassing Validating). However, it doesn't immediately get a StableAt timestamp—it must prove health first.
The key difference from initial deployment validation is that late-joiners wait indefinitely for health proof. There's no deadline, and there's no "benefit of doubt" if health data is pending. The execution must explicitly achieve HealthHealthy status before receiving a StableAt timestamp. This conservative approach ensures that late-arriving nodes are fully validated before being considered part of your stable fleet, even if they join long after the initial deployment wave.
Example Timeline:
Here's what this looks like in practice. Job v2 is deployed at 10:00 AM to 50 nodes. All 50 nodes validate health and reach Running (stable) state by 10:05 AM. A new node joins your fleet at 2:00 PM and receives Job v2. The new node starts the execution in Running state (no StableAt yet), and the system continuously monitors health without any deadline. Once HealthHealthy is confirmed, the execution gets its StableAt timestamp, and the node is now fully validated and part of your stable deployment.
State Transition Summary
Here's how executions transition between states based on health monitoring:
| Current State | Health Status | Deadline Expired? | Transition |
|---|---|---|---|
| Validating | HealthHealthy | N/A | → Running + StableAt |
| Validating | HealthUnhealthy | N/A |