types.OrchestratorConfig
admin_auth object
AdminAuth configures the route-scoped admin JWT middleware that gates web-api → orchestrator admin actions (ENG-2307). Read at startup by the future ENG-2284 enrollment handler. Zero value (Enabled=false) means admin routes that require this middleware will refuse to register; the orchestrator keeps serving its UI, NATS, and OTel surfaces. Top-level rather than nested under APIConfig because APIConfig is shared with the edge agent and admin auth is orchestrator-only.
ActorClaim names the JWT claim carrying the human-operator identity to attach to the request context for audit. Empty means the middleware uses its default ("email").
Audience is the required aud claim value. Distinct from the
UI JWT audience (urn:expanso:orchestrator); admin tokens
carry urn:expanso:orchestrator-admin so a UI-scoped token
cannot be replayed against an admin route.
ClockSkew is the tolerance applied to iat/nbf/exp validation. Zero means the middleware uses its default (60s).
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
Enabled gates registration of admin-protected routes. When false (the zero value), routes that require admin auth refuse to register; the orchestrator still serves its UI, NATS, and OTel surfaces.
Issuer is the required iss claim value. Typically the
web-api's hostname (e.g. "https://app.expanso.io").
MaxBodyBytes caps the request body size the middleware will buffer for ath verification. Zero means the middleware uses its default (64 KiB).
PubKeyPaths is the SET of PEM files holding ES256 public keys the orchestrator accepts for admin JWT signatures. A token verifies if its signature matches ANY key in the set.
RequireATH controls whether the middleware enforces the ath
claim (base64url-encoded SHA-256 of the request body). True
by default per the design — admin tokens MUST be
payload-bound to prevent replay against a different payload.
Pointer type so the zero value (nil) means "default to
enforce" rather than "silently off."
api object
auth object
Auth configures authentication for the API
jwt object
JWT/OIDC auth
Audience is the expected 'aud' claim in JWTs. Defaults to "urn:expanso:orchestrator" if not set. Set to empty string explicitly to disable audience validation.
Issuer URL - required to enable JWT authentication. e.g., "https://cloud.expanso.io" JWKS URL is derived by appending /.well-known/jwks.json Also used to validate the 'iss' claim in JWTs If empty, JWT authentication is disabled.
NetworkClaimName is the JWT claim containing network IDs (default: "networkId")
OrganizationClaimName is the JWT claim containing organization IDs (default: "organizationId")
TokenEndpoint is the OAuth2-compatible endpoint for exchanging API keys (exp_ak_*) for short-lived JWTs. Optional — when set, the orchestrator accepts API keys as Bearer tokens and exchanges them server-side. Requires Issuer to be configured (the exchanged JWTs are validated via JWKS).
OrganizationID this node belongs to - optional If empty, organization validation is skipped (allow all access).
cors object
CORS configures Cross-Origin Resource Sharing for browser-based clients
AllowedOrigins is a list of origins that are allowed to make cross-origin requests. Use exact origins like "https://cloud.expanso.io" or patterns like "https://localhost:*" to match any port on localhost.
Listen address - defaults to localhost:9010 Empty string disables the API server
Core data directory - all subdirectories are managed automatically
evaluation_broker object
InitialRetryDelay is the delay before re-enqueuing a Nacked evaluation for the first time. Defaults to 5 seconds if not set. Set a lower value (e.g., 100ms) for tests to avoid blocking subsequent evaluations for the same job.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
MaxRetryCount specifies the maximum number of times an evaluation can be retried before being marked as failed.
RecoveryRecencyWindow bounds how far back the startup-recovery scan looks for jobs/executions whose recent state changes need re-evaluation. Defaults to the periodic reconciler interval if not set. Increase if your environment experiences extended orchestrator downtime; the periodic reconciler is the long-term safety net regardless.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
VisibilityTimeout specifies how long an evaluation can be claimed before it's returned to the queue.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
log object
Log format: json, text
Log level: trace, debug, info, warn, error - defaults to info
Node name - defaults to hostname
Name provider for auto-generation: "cloud", "hostname", "uuid", "machine-id"
node_manager object
ConnectedAfter is how long a node must be stable in Connecting state before being promoted to Connected. This provides flapping protection - a node that keeps crashing and restarting will reset this timer on each handshake. Default: 30s. Must be less than disconnect_timeout.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
DisconnectTimeout is how long to wait without heartbeats before marking a node as disconnected. This value is sent to edge nodes during handshake so both sides use the same threshold. Default: 90s. Increase for unreliable networks.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
HeartbeatInterval is how often edge nodes should send heartbeats. This value is sent to edge nodes during handshake. Default: 15s.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
LostTimeout is how long a node must remain disconnected before marking it as lost. Default: 1h. Must be greater than disconnect_timeout. Lost nodes are removed from scheduling and become eligible for garbage collection.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
scheduler object
ExecutionLimitBackoff is the duration to wait before creating a new scheduling run when hitting execution limits.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
MaxExecutionsPerEvaluation is the safety ceiling on total scheduler operations per evaluation (creates + stops + updates + failures). Plans exceeding the ceiling fall back to the delayed-evaluation self-heal path.
MaxExecutionsPerTransaction is the maximum number of new-execution writes per planner transaction. The planner splits a plan's NewExecutions into batches of this size, committing each in its own transaction; a final transaction commits job state, follow-up evaluations, and marks the evaluation complete.
QueueBackoff specifies the time to wait before retrying a failed job.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
DefaultQueueTimeoutNeverRestart is the default queue timeout for jobs with "never" restart policy. Batch jobs get fast feedback when no matching nodes exist. Set to 0 for no default (wait indefinitely).
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
DefaultQueueTimeoutOtherPolicies is the default queue timeout for "always" or "on-failure" restart policies. Services wait for matching nodes (e.g., auto-scaling scenarios). Set to 0 for no default (wait indefinitely).
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
WorkerCount specifies the number of concurrent workers for job scheduling.
ShutdownTimeout is the maximum time to wait for graceful shutdown
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
signing_key object
SigningKey points the orchestrator at the on-disk ES256 signing key it uses to mint Node Access Tokens (Internal Node Credentials design §3.2). Empty path is valid — an orchestrator without a signing key continues to run in legacy mode (NATS .creds delivered by cell-api) and the v2 mint / JWKS endpoints stay unregistered. Operators opt into v2 by setting SigningKey.Path.
Top-level field rather than nested under an AuthConfig because AuthConfig already exists in types/config.go for HTTP API auth (JWT/OIDC) and the two concerns are independent — mixing them would force the OIDC-validation surface to know about token-mint state and vice versa.
Path is the filesystem path to the signing-key PEM. Empty means the orchestrator does not mint Node Access Tokens — INC-1 / INC-2 endpoints stay unregistered, and edge nodes continue to use the legacy NATS-creds path. Required to enable v2 token issuance.
store object
Backend selects the persistence engine.
- "boltdb" - the default. Single-file embedded KV store, mature path with the most production miles.
- "sqlite" - WAL-mode SQLite backend. Single-file under DataDir, embedded, no external service needed.
- "postgres" - external PostgreSQL backend. POC: validates the managed-DB / multi-writer model against SQLite/Bolt on the orchestrator side. Requires store.postgres.connection_string (or PG* env vars; see StorePostgresConfig).
Empty defaults to "boltdb" so existing config files keep working untouched. Validate() rejects any other value with a precise error.
gc object
DeletedJobsRetention is how long to keep soft-deleted jobs before permanent deletion. Default: 7 days. Increase for longer audit history.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
LostNodesRetention is how long to keep lost node records after they're marked as lost. Default: 7 days. Measured from when the node transitions to Lost state. Independent of node_manager.lost_timeout.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
TerminalExecutionsRetention is how long to keep terminal execution records (complete/failed/stopped). Default: 7 days. Increase for longer execution history.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
postgres object
Postgres carries the connection settings used when Backend == "postgres". Ignored for other backends.
ConnectionString is the libpq DSN (URI or keyword form). When set, takes precedence over the structured fields below.
Structured connection fields. Used when ConnectionString is empty. Any blank field falls back to the matching PG* env var.
MaxOpenConns caps the size of the connection pool. Tune to your database's connection budget — small managed instances often cap at 30-100 total connections, so scaling to multiple orchestrator replicas requires sizing this down accordingly.
Password is the postgres credential; prefer the PGPASSWORD env var (or any libpq pgpass file) over committing this to YAML.
Schema selects the Postgres schema. Honoured regardless of whether ConnectionString or structured fields are used.
streaming_proxy object
Streaming proxy configuration
ReadTimeoutSeconds is the read timeout in seconds for connections to the remote log server before the connection is closed. A value of 0 means no timeout.
RemoteEndpoint is the endpoint of the log server to proxy to, typically a Loki instance
RemoteToken is the authentication token used to access the log server
telemetry object
authentication object
Authentication configures authentication for telemetry exporters.
Namespace groups telemetry data for all nodes in a namespace.
Token is the authentication token or password.
Type is the authentication type (e.g., "Basic").
DoNotTrack disables telemetry collection when true.
DropMetricPrefixes specifies metric name prefixes to drop.
Endpoint is the telemetry collector endpoint (host:port).
EndpointPath is an optional path prefix under which the collector serves /v1/metrics (or similar).
error_reporting object
ErrorReporting configures error reporting (e.g., Sentry).
Endpoint is the DSN/URL for the error reporting service (e.g., Sentry DSN).
ExportInterval is how often metrics are exported.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
headers object
Headers are sent with every export request (e.g. auth headers).
IncludeGoMetrics enables collection of Go runtime metrics.
Insecure disables TLS verification.
ProcessMetricsInterval is how often process metrics are collected.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
Protocol specifies the export protocol: "grpc" or "http".
resource_attributes object
ResourceAttributes are additional OTel resource attributes.
ResourceDetectors enables optional OTel resource detectors that add labels to all metrics. Empty by default — opt in to what you need. Supported values: "host" (host.name, host.id, host.arch), "os" (os.type), "container" (container.id)
token object
Token configures the Node Access Token issuance policy: how long minted tokens stay valid and what the iss claim contains. Read at startup by the /token handler (ENG-2319). Zero values fall back to sensible defaults (1h TTL, empty iss surfaces as the orchestrator's HTTP base URL). Independent of SigningKey because issuance policy is a runtime decision (how long the token is good for, who claims to have issued it) while the signing key is the cryptographic material that backs the signature.
Issuer is the value written to each token's iss claim. Empty (default) means the handler falls back to the orchestrator's HTTP base URL inferred from the request — the same URL the JWKS endpoint will be published at, so downstream verifiers can resolve issuer→JWKS via OIDC discovery (ENG-2320). Set explicitly when the orchestrator runs behind a TLS-terminating proxy that rewrites Host.
TTL is the lifetime of each issued Node Access Token. Zero (default) means 1 hour. Operators can shorten this for high-rotation environments or lengthen it where DPoP refresh cost is meaningful.
Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]
transport object
ListenAddr - when set, runs embedded server for nodes to connect to If specified, this overrides any server address from credentials/bootstrapping
Connection config settings
{
"admin_auth": {
"actor_claim": "string",
"audience": "string",
"clock_skew": -9223372036854776000,
"enabled": true,
"issuer": "string",
"max_body_bytes": 0,
"pubkey_paths": [
"string"
],
"require_ath": true
},
"api": {
"auth": {
"jwt": {
"audience": "string",
"issuer": "string",
"network_claim_name": "string",
"organization_claim_name": "string",
"token_endpoint": "string"
},
"organization_id": "string"
},
"cors": {
"allowed_origins": [
"string"
]
},
"listen_addr": "string"
},
"data_dir": "string",
"evaluation_broker": {
"initial_retry_delay": -9223372036854776000,
"max_retry_count": 0,
"recovery_recency_window": -9223372036854776000,
"visibility_timeout": -9223372036854776000
},
"log": {
"format": "string",
"level": "string"
},
"name": "string",
"name_provider": "string",
"node_manager": {
"connected_after": -9223372036854776000,
"disconnect_timeout": -9223372036854776000,
"heartbeat_interval": -9223372036854776000,
"lost_timeout": -9223372036854776000
},
"scheduler": {
"execution_limit_backoff": -9223372036854776000,
"max_executions_per_evaluation": 0,
"max_executions_per_transaction": 0,
"queue_backoff": -9223372036854776000,
"queue_timeout_never": -9223372036854776000,
"queue_timeout_other": -9223372036854776000,
"worker_count": 0
},
"shutdown_timeout": -9223372036854776000,
"signing_key": {
"path": "string"
},
"store": {
"backend": "string",
"gc": {
"deleted_jobs_retention": -9223372036854776000,
"lost_nodes_retention": -9223372036854776000,
"terminal_executions_retention": -9223372036854776000
},
"postgres": {
"connection_string": "string",
"database": "string",
"host": "string",
"max_open_conns": 0,
"password": "string",
"port": 0,
"schema": "string",
"sslmode": "string",
"user": "string"
}
},
"streaming_proxy": {
"read_timeout_seconds": 0,
"remote_endpoint": "string",
"remote_token": "string"
},
"telemetry": {
"authentication": {
"namespace": "string",
"token": "string",
"type": "string"
},
"do_not_track": true,
"drop_metric_prefixes": [
"string"
],
"endpoint": "string",
"endpoint_path": "string",
"error_reporting": {
"endpoint": "string"
},
"export_interval": -9223372036854776000,
"headers": {},
"include_go_metrics": true,
"insecure": true,
"process_metrics_interval": -9223372036854776000,
"protocol": "string",
"resource_attributes": {},
"resource_detectors": [
"string"
]
},
"token": {
"issuer": "string",
"ttl": -9223372036854776000
},
"transport": {
"address": "string",
"credentials_path": "string",
"insecure": true,
"listen_addr": "string",
"network_id": "string",
"node_id": "string",
"refresh_address": "string",
"require_tls": true,
"reverse_proxy": true
}
}