Nodes Show Connected After Restart
After an orchestrator restart (upgrades, crashes, deployments), nodes may briefly show "Connected" even though they haven't actually reconnected yet.
Symptoms
- Nodes appear "Connected" immediately after orchestrator restart
- Jobs scheduled during the first 90 seconds may have executions stuck in Pending
- After about 90 seconds, nodes that failed to reconnect transition to "Disconnected"
Why This Happens
The orchestrator saves node state to disk. When it restarts, it loads the old "Connected" state—even though those network connections no longer exist.
To handle this, the orchestrator uses a 90-second grace period after startup. During this window, it doesn't mark nodes as disconnected, giving them time to re-establish heartbeats. Nodes that don't reconnect within this window are then marked as disconnected.
How It Self-Heals
You typically don't need to do anything—the system recovers automatically:
- Grace period begins — The orchestrator waits 90 seconds before checking heartbeat timeouts
- Nodes reconnect — Edge nodes detect the restart and re-establish their connections
- State stabilizes — Nodes with fresh heartbeats stay connected; others transition to disconnected
After 90 seconds, the system accurately reflects which nodes are actually connected.
Debugging
If you suspect connection issues after a restart:
# List nodes and their connection status
expanso-cli node list
# Check if a specific node is receiving heartbeats
expanso-cli node describe <node-id>
# Check execution states for recently deployed jobs
expanso-cli execution list --job <job-id>
If executions stay in Pending for more than 2 minutes, the node may have genuinely lost connectivity (network issues, node offline, etc.).
To verify which nodes reconnected after a restart:
# After orchestrator restart, verify which nodes reconnected
expanso-cli node list
# Look for nodes still in "Connecting" or "Disconnected" state
# These nodes may have network issues preventing reconnection
You don't need to wait before deploying jobs after an orchestrator restart. The grace period handles the transition automatically, and jobs will be scheduled to nodes once they've re-established their connections.