Pipeline Stuck in Degraded State
Sometimes a pipeline gets stuck in a degraded state and won't recover on its own. This usually happens when the executor has a stale process that ignores restart signals from the scheduler.
Diagnose the Problem
First, check if your execution is actually stuck:
# List executions by state
expanso-cli execution list --state degraded
# Check how long execution has been degraded
expanso-cli execution describe <execution-id>
# If the execution has been degraded longer than your retry backoff period
# (e.g., 10+ minutes when backoff is 5 minutes), it's likely stuck
Solutions
Stop and restart the job
# Stop the job (terminates stuck executions)
expanso-cli job stop <job-id>
# Restart the job
expanso-cli job deploy your-pipeline.yaml
Restart the edge agent
If stopping the job doesn't work:
# On the edge node
sudo systemctl restart expanso-edge
Monitoring and Prevention
Catch stuck executions automatically:
# Check for executions stuck in Degraded for >10 minutes
expanso-cli execution list --state degraded --format json | \
jq '.[] | select(.updated_at < (now - 600))'
To prevent this in the future:
- Set alerts for executions stuck in Degraded state
- Use health checks and automatic rollback when available
- Design pipelines with error handling and retries