Skip to main content

Pipeline Stuck in Degraded State

Sometimes a pipeline gets stuck in a degraded state and won't recover on its own. This usually happens when the executor has a stale process that ignores restart signals from the scheduler.

Diagnose the Problem

First, check if your execution is actually stuck:

# List executions by state
expanso-cli execution list --state degraded

# Check how long execution has been degraded
expanso-cli execution describe <execution-id>

# If the execution has been degraded longer than your retry backoff period
# (e.g., 10+ minutes when backoff is 5 minutes), it's likely stuck

Solutions

Stop and restart the job

# Stop the job (terminates stuck executions)
expanso-cli job stop <job-id>

# Restart the job
expanso-cli job deploy your-pipeline.yaml

Restart the edge agent

If stopping the job doesn't work:

# On the edge node
sudo systemctl restart expanso-edge

Monitoring and Prevention

Catch stuck executions automatically:

# Check for executions stuck in Degraded for >10 minutes
expanso-cli execution list --state degraded --format json | \
jq '.[] | select(.updated_at < (now - 600))'

To prevent this in the future:

  • Set alerts for executions stuck in Degraded state
  • Use health checks and automatic rollback when available
  • Design pipelines with error handling and retries