Skip to main content

types.OrchestratorConfig

admin_auth object

AdminAuth configures the route-scoped admin JWT middleware that gates web-api → orchestrator admin actions (ENG-2307). Read at startup by the future ENG-2284 enrollment handler. Zero value (Enabled=false) means admin routes that require this middleware will refuse to register; the orchestrator keeps serving its UI, NATS, and OTel surfaces. Top-level rather than nested under APIConfig because APIConfig is shared with the edge agent and admin auth is orchestrator-only.

actor_claimstring

ActorClaim names the JWT claim carrying the human-operator identity to attach to the request context for audit. Empty means the middleware uses its default ("email").

audiencestring

Audience is the required aud claim value. Distinct from the UI JWT audience (urn:expanso:orchestrator); admin tokens carry urn:expanso:orchestrator-admin so a UI-scoped token cannot be replayed against an admin route.

clock_skewinteger<int64>

ClockSkew is the tolerance applied to iat/nbf/exp validation. Zero means the middleware uses its default (60s).

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

enabledboolean

Enabled gates registration of admin-protected routes. When false (the zero value), routes that require admin auth refuse to register; the orchestrator still serves its UI, NATS, and OTel surfaces.

issuerstring

Issuer is the required iss claim value. Typically the web-api's hostname (e.g. "https://app.expanso.io").

max_body_bytesinteger

MaxBodyBytes caps the request body size the middleware will buffer for ath verification. Zero means the middleware uses its default (64 KiB).

pubkey_pathsstring[]

PubKeyPaths is the SET of PEM files holding ES256 public keys the orchestrator accepts for admin JWT signatures. A token verifies if its signature matches ANY key in the set.

require_athboolean

RequireATH controls whether the middleware enforces the ath claim (base64url-encoded SHA-256 of the request body). True by default per the design — admin tokens MUST be payload-bound to prevent replay against a different payload. Pointer type so the zero value (nil) means "default to enforce" rather than "silently off."

api object
auth object

Auth configures authentication for the API

jwt object

JWT/OIDC auth

audiencestring

Audience is the expected 'aud' claim in JWTs. Defaults to "urn:expanso:orchestrator" if not set. Set to empty string explicitly to disable audience validation.

issuerstring

Issuer URL - required to enable JWT authentication. e.g., "https://cloud.expanso.io" JWKS URL is derived by appending /.well-known/jwks.json Also used to validate the 'iss' claim in JWTs If empty, JWT authentication is disabled.

network_claim_namestring

NetworkClaimName is the JWT claim containing network IDs (default: "networkId")

organization_claim_namestring

OrganizationClaimName is the JWT claim containing organization IDs (default: "organizationId")

token_endpointstring

TokenEndpoint is the OAuth2-compatible endpoint for exchanging API keys (exp_ak_*) for short-lived JWTs. Optional — when set, the orchestrator accepts API keys as Bearer tokens and exchanges them server-side. Requires Issuer to be configured (the exchanged JWTs are validated via JWKS).

organization_idstring

OrganizationID this node belongs to - optional If empty, organization validation is skipped (allow all access).

cors object

CORS configures Cross-Origin Resource Sharing for browser-based clients

allowed_originsstring[]

AllowedOrigins is a list of origins that are allowed to make cross-origin requests. Use exact origins like "https://cloud.expanso.io" or patterns like "https://localhost:*" to match any port on localhost.

listen_addrstring

Listen address - defaults to localhost:9010 Empty string disables the API server

data_dirstring

Core data directory - all subdirectories are managed automatically

evaluation_broker object
initial_retry_delayinteger<int64>

InitialRetryDelay is the delay before re-enqueuing a Nacked evaluation for the first time. Defaults to 5 seconds if not set. Set a lower value (e.g., 100ms) for tests to avoid blocking subsequent evaluations for the same job.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

max_retry_countinteger

MaxRetryCount specifies the maximum number of times an evaluation can be retried before being marked as failed.

recovery_recency_windowinteger<int64>

RecoveryRecencyWindow bounds how far back the startup-recovery scan looks for jobs/executions whose recent state changes need re-evaluation. Defaults to the periodic reconciler interval if not set. Increase if your environment experiences extended orchestrator downtime; the periodic reconciler is the long-term safety net regardless.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

visibility_timeoutinteger<int64>

VisibilityTimeout specifies how long an evaluation can be claimed before it's returned to the queue.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

log object
formatstring

Log format: json, text

levelstring

Log level: trace, debug, info, warn, error - defaults to info

namestring

Node name - defaults to hostname

name_providerstring

Name provider for auto-generation: "cloud", "hostname", "uuid", "machine-id"

node_manager object
connected_afterinteger<int64>

ConnectedAfter is how long a node must be stable in Connecting state before being promoted to Connected. This provides flapping protection - a node that keeps crashing and restarting will reset this timer on each handshake. Default: 30s. Must be less than disconnect_timeout.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

disconnect_timeoutinteger<int64>

DisconnectTimeout is how long to wait without heartbeats before marking a node as disconnected. This value is sent to edge nodes during handshake so both sides use the same threshold. Default: 90s. Increase for unreliable networks.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

heartbeat_intervalinteger<int64>

HeartbeatInterval is how often edge nodes should send heartbeats. This value is sent to edge nodes during handshake. Default: 15s.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

lost_timeoutinteger<int64>

LostTimeout is how long a node must remain disconnected before marking it as lost. Default: 1h. Must be greater than disconnect_timeout. Lost nodes are removed from scheduling and become eligible for garbage collection.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

scheduler object
execution_limit_backoffinteger<int64>

ExecutionLimitBackoff is the duration to wait before creating a new scheduling run when hitting execution limits.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

max_executions_per_evaluationinteger

MaxExecutionsPerEvaluation is the safety ceiling on total scheduler operations per evaluation (creates + stops + updates + failures). Plans exceeding the ceiling fall back to the delayed-evaluation self-heal path.

max_executions_per_transactioninteger

MaxExecutionsPerTransaction is the maximum number of new-execution writes per planner transaction. The planner splits a plan's NewExecutions into batches of this size, committing each in its own transaction; a final transaction commits job state, follow-up evaluations, and marks the evaluation complete.

queue_backoffinteger<int64>

QueueBackoff specifies the time to wait before retrying a failed job.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

queue_timeout_neverinteger<int64>

DefaultQueueTimeoutNeverRestart is the default queue timeout for jobs with "never" restart policy. Batch jobs get fast feedback when no matching nodes exist. Set to 0 for no default (wait indefinitely).

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

queue_timeout_otherinteger<int64>

DefaultQueueTimeoutOtherPolicies is the default queue timeout for "always" or "on-failure" restart policies. Services wait for matching nodes (e.g., auto-scaling scenarios). Set to 0 for no default (wait indefinitely).

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

worker_countinteger

WorkerCount specifies the number of concurrent workers for job scheduling.

shutdown_timeoutinteger<int64>

ShutdownTimeout is the maximum time to wait for graceful shutdown

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

signing_key object

SigningKey points the orchestrator at the on-disk ES256 signing key it uses to mint Node Access Tokens (Internal Node Credentials design §3.2). Empty path is valid — an orchestrator without a signing key continues to run in legacy mode (NATS .creds delivered by cell-api) and the v2 mint / JWKS endpoints stay unregistered. Operators opt into v2 by setting SigningKey.Path.

Top-level field rather than nested under an AuthConfig because AuthConfig already exists in types/config.go for HTTP API auth (JWT/OIDC) and the two concerns are independent — mixing them would force the OIDC-validation surface to know about token-mint state and vice versa.

pathstring

Path is the filesystem path to the signing-key PEM. Empty means the orchestrator does not mint Node Access Tokens — INC-1 / INC-2 endpoints stay unregistered, and edge nodes continue to use the legacy NATS-creds path. Required to enable v2 token issuance.

store object
backendstring

Backend selects the persistence engine.

  • "boltdb" - the default. Single-file embedded KV store, mature path with the most production miles.
  • "sqlite" - WAL-mode SQLite backend. Single-file under DataDir, embedded, no external service needed.
  • "postgres" - external PostgreSQL backend. POC: validates the managed-DB / multi-writer model against SQLite/Bolt on the orchestrator side. Requires store.postgres.connection_string (or PG* env vars; see StorePostgresConfig).

Empty defaults to "boltdb" so existing config files keep working untouched. Validate() rejects any other value with a precise error.

gc object
deleted_jobs_retentioninteger<int64>

DeletedJobsRetention is how long to keep soft-deleted jobs before permanent deletion. Default: 7 days. Increase for longer audit history.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

lost_nodes_retentioninteger<int64>

LostNodesRetention is how long to keep lost node records after they're marked as lost. Default: 7 days. Measured from when the node transitions to Lost state. Independent of node_manager.lost_timeout.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

terminal_executions_retentioninteger<int64>

TerminalExecutionsRetention is how long to keep terminal execution records (complete/failed/stopped). Default: 7 days. Increase for longer execution history.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

postgres object

Postgres carries the connection settings used when Backend == "postgres". Ignored for other backends.

connection_stringstring

ConnectionString is the libpq DSN (URI or keyword form). When set, takes precedence over the structured fields below.

databasestring
hoststring

Structured connection fields. Used when ConnectionString is empty. Any blank field falls back to the matching PG* env var.

max_open_connsinteger

MaxOpenConns caps the size of the connection pool. Tune to your database's connection budget — small managed instances often cap at 30-100 total connections, so scaling to multiple orchestrator replicas requires sizing this down accordingly.

passwordstring

Password is the postgres credential; prefer the PGPASSWORD env var (or any libpq pgpass file) over committing this to YAML.

portinteger
schemastring

Schema selects the Postgres schema. Honoured regardless of whether ConnectionString or structured fields are used.

sslmodestring
userstring
streaming_proxy object

Streaming proxy configuration

read_timeout_secondsinteger

ReadTimeoutSeconds is the read timeout in seconds for connections to the remote log server before the connection is closed. A value of 0 means no timeout.

remote_endpointstring

RemoteEndpoint is the endpoint of the log server to proxy to, typically a Loki instance

remote_tokenstring

RemoteToken is the authentication token used to access the log server

telemetry object
authentication object

Authentication configures authentication for telemetry exporters.

namespacestring

Namespace groups telemetry data for all nodes in a namespace.

tokenstring

Token is the authentication token or password.

typestring

Type is the authentication type (e.g., "Basic").

do_not_trackboolean

DoNotTrack disables telemetry collection when true.

drop_metric_prefixesstring[]

DropMetricPrefixes specifies metric name prefixes to drop.

endpointstring

Endpoint is the telemetry collector endpoint (host:port).

endpoint_pathstring

EndpointPath is an optional path prefix under which the collector serves /v1/metrics (or similar).

error_reporting object

ErrorReporting configures error reporting (e.g., Sentry).

endpointstring

Endpoint is the DSN/URL for the error reporting service (e.g., Sentry DSN).

export_intervalinteger<int64>

ExportInterval is how often metrics are exported.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

headers object

Headers are sent with every export request (e.g. auth headers).

property name*string
include_go_metricsboolean

IncludeGoMetrics enables collection of Go runtime metrics.

insecureboolean

Insecure disables TLS verification.

process_metrics_intervalinteger<int64>

ProcessMetricsInterval is how often process metrics are collected.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

protocolstring

Protocol specifies the export protocol: "grpc" or "http".

resource_attributes object

ResourceAttributes are additional OTel resource attributes.

property name*string
resource_detectorsstring[]

ResourceDetectors enables optional OTel resource detectors that add labels to all metrics. Empty by default — opt in to what you need. Supported values: "host" (host.name, host.id, host.arch), "os" (os.type), "container" (container.id)

token object

Token configures the Node Access Token issuance policy: how long minted tokens stay valid and what the iss claim contains. Read at startup by the /token handler (ENG-2319). Zero values fall back to sensible defaults (1h TTL, empty iss surfaces as the orchestrator's HTTP base URL). Independent of SigningKey because issuance policy is a runtime decision (how long the token is good for, who claims to have issued it) while the signing key is the cryptographic material that backs the signature.

issuerstring

Issuer is the value written to each token's iss claim. Empty (default) means the handler falls back to the orchestrator's HTTP base URL inferred from the request — the same URL the JWKS endpoint will be published at, so downstream verifiers can resolve issuer→JWKS via OIDC discovery (ENG-2320). Set explicitly when the orchestrator runs behind a TLS-terminating proxy that rewrites Host.

ttlinteger<int64>

TTL is the lifetime of each issued Node Access Token. Zero (default) means 1 hour. Operators can shorten this for high-rotation environments or lengthen it where DPoP refresh cost is meaningful.

Possible values: [-9223372036854776000, 9223372036854776000, 1, 1000, 1000000, 1000000000, 60000000000, 3600000000000]

transport object
addressstring
credentials_pathstring
insecureboolean
listen_addrstring

ListenAddr - when set, runs embedded server for nodes to connect to If specified, this overrides any server address from credentials/bootstrapping

network_idstring
node_idstring

Connection config settings

refresh_addressstring
require_tlsboolean
reverse_proxyboolean
types.OrchestratorConfig
{
"admin_auth": {
"actor_claim": "string",
"audience": "string",
"clock_skew": -9223372036854776000,
"enabled": true,
"issuer": "string",
"max_body_bytes": 0,
"pubkey_paths": [
"string"
],
"require_ath": true
},
"api": {
"auth": {
"jwt": {
"audience": "string",
"issuer": "string",
"network_claim_name": "string",
"organization_claim_name": "string",
"token_endpoint": "string"
},
"organization_id": "string"
},
"cors": {
"allowed_origins": [
"string"
]
},
"listen_addr": "string"
},
"data_dir": "string",
"evaluation_broker": {
"initial_retry_delay": -9223372036854776000,
"max_retry_count": 0,
"recovery_recency_window": -9223372036854776000,
"visibility_timeout": -9223372036854776000
},
"log": {
"format": "string",
"level": "string"
},
"name": "string",
"name_provider": "string",
"node_manager": {
"connected_after": -9223372036854776000,
"disconnect_timeout": -9223372036854776000,
"heartbeat_interval": -9223372036854776000,
"lost_timeout": -9223372036854776000
},
"scheduler": {
"execution_limit_backoff": -9223372036854776000,
"max_executions_per_evaluation": 0,
"max_executions_per_transaction": 0,
"queue_backoff": -9223372036854776000,
"queue_timeout_never": -9223372036854776000,
"queue_timeout_other": -9223372036854776000,
"worker_count": 0
},
"shutdown_timeout": -9223372036854776000,
"signing_key": {
"path": "string"
},
"store": {
"backend": "string",
"gc": {
"deleted_jobs_retention": -9223372036854776000,
"lost_nodes_retention": -9223372036854776000,
"terminal_executions_retention": -9223372036854776000
},
"postgres": {
"connection_string": "string",
"database": "string",
"host": "string",
"max_open_conns": 0,
"password": "string",
"port": 0,
"schema": "string",
"sslmode": "string",
"user": "string"
}
},
"streaming_proxy": {
"read_timeout_seconds": 0,
"remote_endpoint": "string",
"remote_token": "string"
},
"telemetry": {
"authentication": {
"namespace": "string",
"token": "string",
"type": "string"
},
"do_not_track": true,
"drop_metric_prefixes": [
"string"
],
"endpoint": "string",
"endpoint_path": "string",
"error_reporting": {
"endpoint": "string"
},
"export_interval": -9223372036854776000,
"headers": {},
"include_go_metrics": true,
"insecure": true,
"process_metrics_interval": -9223372036854776000,
"protocol": "string",
"resource_attributes": {},
"resource_detectors": [
"string"
]
},
"token": {
"issuer": "string",
"ttl": -9223372036854776000
},
"transport": {
"address": "string",
"credentials_path": "string",
"insecure": true,
"listen_addr": "string",
"network_id": "string",
"node_id": "string",
"refresh_address": "string",
"require_tls": true,
"reverse_proxy": true
}
}