KEYCLOAK-108- Spike - Zero down time realm import

KEYCLOAK-108- Spike - Zero down time realm import

Spike Overview

https://folio-org.atlassian.net/browse/KEYCLOAK-108

1. Executive Summary

The current FOLIO procedure for moving a tenant between Keycloak clusters relies on kc.sh export on the source side and kc.sh import on the destination side. The destination step requires all Keycloak nodes in the cluster to be stopped, which translates into tenant downtime during the cutover. This spike investigates whether that downtime can be eliminated for the migration scenario specific to FOLIO — namely, importing a realm into a destination cluster where no realm with that name already exists.

After reviewing the official Keycloak 26.5.x documentation, the source code of DefaultExportImportManager and PartialImportManager, the official GitHub discussion thread on the topic, the conclusion is:

A zero-downtime migration is achievable as a pure operational procedure, with no FOLIO code changes. The supported zero-downtime path is to feed the exported RealmRepresentation JSON to the running destination cluster's Admin REST API (POST /admin/realms) using a standard curl/jq shell script. This is the same code path that mgr-tenants already uses to create new realms during normal tenant onboarding. Doing this against a non-existing realm avoids every consistency hazard that the documentation warns about, since those hazards are specifically about overwrite — and overwrite is not part of FOLIO's migration scenario.

The kc.sh import command must continue to be treated as offline-only, and a DB-level copy must be treated as unsupported and dangerous for a multi-node cluster because Keycloak's local Infinispan caches are invalidation-based, not replicated.

 

2. Background and Problem Statement

2.1 Current state

  • FOLIO tenants map 1:1 to Keycloak realms. mgr-tenants creates, updates and deletes realms via the Keycloak Admin REST API (POST /admin/realms, DELETE /admin/realms/{realm})

  • When a tenant must be moved from cluster A to cluster B, the operations team:

    1. Performs kc.sh export --realm <tenant> on a clone of cluster A's database (this avoids stopping production).

    2. Performs kc.sh import --file <tenant>-realm.json on cluster B — and this requires the destination cluster to be shut down.

  • Step 2 is the source of the downtime that this spike must eliminate.

3. What the Keycloak Documentation Actually Says

Quoting verbatim from keycloak.org/server/importExport (current as of Keycloak 26.5.x):

"The import and export commands are not designed to be run from the same machine as a running server instance, which may result in port or other conflicts."

"It is recommended that all Keycloak nodes are stopped prior to using the kc.[sh|bat] export command. This ensures that the results will have no consistency issues with user or realm modifications during the export."

"It is required that all Keycloak nodes are stopped prior to using the kc.[sh|bat] import command with the override option. The command does not attach to the cache cluster, so overwriting a realm will lead to inconsistent caches in the cluster, which then would show and use inconsistent or outdated information."

"Instead of overwriting a realm with the import command, consider using the Admin API to delete realms that need to be overwritten prior to running the import."

"Your Keycloak server instance must not be started when invoking this command." (applies to both import and export commands.)

And from a Keycloak collaborator (shawkins) responding to the official "How to Import & Export without downtime" discussion (keycloak#31044):

"The primary concern has been consistency, so the recommendation has been to shutdown down the cluster. The hope is that #37991 will alleviate that concern for most scenarios. […] import/export commands aren't specifically tweaked to run alongside a running server — since they are effectively a server launch that terminates after the import / export operation there can be conflicts. The current workaround is to run the import/export command in a separate container or Pod."

4. How Keycloak Imports a Realm (Source-Code View)

This section answers the question "why does the documentation say to shut down Keycloak?" by showing what actually happens inside the JVM during an import.

4.1 Bootstrap path (kc.sh import)

kc.sh import is implemented as a special server launch that runs the import inside Keycloak's normal bootstrap sequence and then exits. The relevant code lives in KeycloakApplication.bootstrap():

// Bootstrap master realm, import realms and create admin user. protected ExportImportManager bootstrap(KeycloakSession session) { logger.debug("bootstrap"); // ... exportImportManager.runImport(); createTemporaryAdmin(session); // ... if (!bootstrapState.newInstall) { bootstrapState.exportImportManager.runImport(); } }

runImport() calls DefaultExportImportManager.importRealm(InputStream) , which executes within a single JPA transaction that the bootstrap launcher configures via:

ExportImportConfig.setSingleTransaction(true); // ... runImport() ... ExportImportConfig.setSingleTransaction(false);

Two important consequences follow:

  1. kc.sh import is not "a CLI tool talking to the DB" — it is a full JVM startup that happens to skip serving HTTP and exits afterwards. That JVM goes through Liquibase migrations, builds Quarkus, opens a JDBC pool, and then runs the import inside the same KeycloakSession machinery the live server uses.

  2. It does not join the cluster. It does not register itself as a JGroups node, does not subscribe to the Infinispan invalidation channel, and does not publish RealmUpdatedEvents.

That second point is exactly what the documentation warns about for --override: when the CLI rewrites an existing REALM row, the live nodes' local caches still hold the old CachedRealm and CachedClient objects and there is no message telling them to drop those entries.

4.2 Admin REST API path (POST /admin/realms)

When you POST a RealmRepresentation JSON to the running cluster's Admin REST API, the request is handled by RealmsAdminResource.importRealm(...), which calls the same DefaultExportImportManager.importRealm(InputStream) method — but now from inside a regular running Keycloak node. Because the call is happening inside a normal KeycloakSession:

  • Liquibase migrations are not re-run.

  • The single transaction is the request transaction, committed normally.

  • RealmModel creation fires RealmModel.RealmCreationEvent through KeycloakSession.getKeycloakSessionFactory().publish(...), which Infinispan listens for and propagates as a cluster invalidation.

  • All other nodes in the cluster pick up the new realm on first access, or via the standard cache invalidation mechanism described in Keycloak's caching docs (local cache + invalidation messages, not replication).

In other words: the same JSON file, fed to the same Java method, behaves completely differently depending on whether the surrounding process is kc.sh import (bootstrap, no cluster) or a running node serving the Admin REST API (cluster-aware, transactional, event-publishing).

4.3 What the partial import path covers

PartialImportManager is the implementation behind POST /admin/realms/{realm}/partialImport. The constructor on main (source) reads:

public PartialImportManager(PartialImportRepresentation rep, KeycloakSession session, RealmModel realm) { this.rep = rep; this.session = session; this.realm = realm; partialImports.add(new ClientsPartialImport()); partialImports.add(new RolesPartialImport()); partialImports.add(new IdentityProvidersPartialImport()); partialImports.add(new IdentityProviderMappersPartialImport()); partialImports.add(new GroupsPartialImport()); partialImports.add(new UsersPartialImport()); }

Note carefully what is not in that list:

  • No ClientScopesPartialImport (confirmed open issue keycloak#16289"Partial realm import ignores client scopes and client scope mappings").

  • No authentication flows / authenticator configs / required actions.

  • No realm-level settings (token lifespans, brute-force, themes, SMTP, password policy, …).

  • No components (key providers, user storage federation, custom SPI configs).

  • No authorization policies/permissions (resource server config).

  • No client-scope mappings or default scopes.

  • No realm events configuration, localization, or organization data.

For FOLIO, this is a hard blocker: realms used by FOLIO contain a mod-tenant-managed authentication flow, custom client scopes (per KC_LOGIN_CLIENT_SUFFIX), authorization permissions populated by mgr-tenants from module descriptors (mgr-tenants README), and per-tenant clients (-login-application, sidecar-module-access-client, password-reset-client). A "partial import" therefore, reproduces only a fraction of a FOLIO realm and will not produce a working tenant.

4.4 Cache architecture — why DB copy is dangerous

Per the Infinispan invalidation cache documentation, Keycloak's realms, users, keys, and authorization caches are configured as invalidation caches, not replicated caches:

"When a cache is configured for invalidation, each data change in a cache triggers a message to other caches in the cluster, informing them that their data is now stale and should be removed from memory. […] Never use invalidation mode without a persistent store […] some nodes will keep seeing the stale value."

The only mechanism by which a Keycloak node learns that another node has changed a realm is the JGroups invalidation message that Keycloak sends when it executes the change through KeycloakSession. A direct INSERT into the REALM (and related ~90 tables ) does not generate such a message. The result is a cluster where:

  • Some nodes return 404 for /realms/<tenant>/.well-known/openid-configuration because their local cache has a negative entry for the realm name.

  • Other nodes successfully serve the realm.

  • The behaviour is non-deterministic and stays broken until a node restarts or is forced to invalidate.

This is the same root cause as the --override warning — and applies equally to any approach that mutates the DB without going through a running KeycloakSession.

5. Detailed Analysis of Each Option

The four options put forward in the spike charter are analysed below. Each section ends with a Verdict line summarising whether the option is viable for FOLIO's "move tenant to a clean destination cluster, no FOLIO code changes" scenario.

5.1 Option A — kc.sh import against the running destination cluster

What it would look like. Run bin/kc.sh import --file <tenant>-realm.json on a sidecar pod that connects to the destination cluster's database, while the destination cluster's other Keycloak pods continue to serve traffic.

What actually happens in practice.

  • If executed on the same host as a Keycloak node, it conflicts on ports and on the optimised build cache, per the official documentation ("not designed to be run from the same machine as a running server instance, which may result in port or other conflicts" — importExport docs).

  • If executed in a separate container/pod with the same KC_DB_URL, the port/build conflict disappears (this is the workaround explicitly recommended by maintainer shawkins), but the cache-invalidation problem remains for any existing realm.

  • For a brand-new realm (FOLIO's case), no node has a CachedRealm for the new name yet, so there is no stale entry to invalidate. The first request that hits any live node for /realms/<new-tenant>/... goes through RealmProvider.getRealmByName(), which on cache-miss reads from JPA and populates the cache from the freshly-inserted rows. In theory the new-realm case is therefore safe.

  • However, two practical risks remain even for the new-realm case:

    1. Liquibase race. kc.sh import runs Keycloak's full bootstrap, which includes a Liquibase schema check. If the destination cluster is on a different patch version than the importing CLI image, Liquibase may attempt schema changes while live nodes are running — this is exactly the scenario the Keycloak upgrading guide warns about for online schema migration.

    2. Liquibase advisory lock. Liquibase takes an advisory lock on the DATABASECHANGELOGLOCK table at startup. If a live node restarts (HPA scale-up, rolling restart, pod eviction) while kc.sh import holds that lock, the live node will block on bootstrap until the lock releases, producing partial-cluster downtime that looks like a flake.

  • Maintainer shawkins confirms there is no plan to make kc.sh import cluster-aware: "import/export commands aren't specifically tweaked to run along side a running server" (keycloak#31044).

Pros

  • Reuses the existing exported file format.

  • No FOLIO code changes.

  • Imports the realm in full fidelity.

Cons

  • Officially unsupported as an online operation.

  • Liquibase lock can stall live node restarts.

  • Optimised-build side effect can affect the next start of any node sharing storage.

  • Behaviour is documented as "may have conflicts" — not "will work" — so it cannot be relied on in production runbooks.

Verdict. Not recommended as a primary path. Acceptable only as a fallback for an offline migration window. The "separate pod, separate JVM" workaround removes the port conflict but does not remove the Liquibase lock or the build-cache side effects.

5.2 Option B — Admin Console / Admin REST API "Partial Import"

What it would look like. Manually create an empty realm via the destination Admin Console, then upload the source export through the Admin Console's "Partial Import" dialog.

What it actually covers. Per the source code and the official documentation:

"The Admin Console partial import can also import files created by the CLI export command. […] If the file contains users, those users will also be available for importing into the current realm." (importExport docs)

But — also from the docs: "the Admin Console also offers only the capability to partially export a realm. […] The users for that realm cannot be exported using this method." And the partial import handler list is fixed at six resource types (clients, roles, identity providers, identity-provider mappers, groups, users).

Gaps that matter for FOLIO

FOLIO realm element

Covered by partial import?

FOLIO realm element

Covered by partial import?

Realm-level settings (token lifespans, password policy, brute-force, SSL mode)

Authentication flows (FOLIO's mod-tenant flow, broker flows from SSO Configuration In Keycloak for Folio)

Required actions and authenticator configs

Client scopes (-login-application scopes)

❌ (keycloak#16289)

Default & optional client scope mappings

Authorization permissions and resources (populated by mgr-tenants)

Components (key providers, RSA keys, HMAC keys)

Realm SMTP, themes, events config, localisation

Organizations (Keycloak 26 organisations feature)

Clients

Roles (realm + client)

Groups

Users + credentials

Identity providers + mappers

Pros

  • Online operation by design.

  • Cache stays consistent across the cluster.

  • No FOLIO code changes.

Cons

  • Loses authentication flows, client scopes, authorization permissions, components, realm-level config. These are not optional for a working FOLIO tenant.

  • Manual UI workflow is not scriptable for production rollout.

Verdict. Not sufficient as a standalone solution. Could be used as a complement to Option C (e.g., to stream users in batches), but cannot carry a FOLIO migration on its own.

5.3 Option C — Full realm POST /admin/realms against the running cluster (Recommended)

This is the recommended path.

Pros

  • Fully online. No destination-cluster downtime.

  • Cache-correct. Goes through KeycloakSession; invalidation events fire as expected.

  • Full fidelity. All realm content carried over.

  • No FOLIO code changes. Pure operational shell script.

  • No new operational surface. Same endpoint mgr-tenants already uses for tenant creation.

  • Reversible. If the import fails, a DELETE /admin/realms/{name} cleans up cleanly.

Cons

  • Source export must still be done from a DB clone or during a quiet period to avoid in-flight inconsistency (this is unchanged from today).

  • Body size requires chunking for large tenants — handled in the script.

  • Client secrets travel in plaintext through the migration channel — must be in-VPC or TLS-only.

  • Refresh-token signing keys must be migrated together; otherwise existing refresh tokens are invalidated at next use.

Verdict. Recommended primary path. Aligns with FOLIO's existing integration model, satisfies zero-downtime requirement, and sidesteps every documented hazard — without modifying any FOLIO/EUREKA component.

5.4 Option D — Database-level copy / dump / pg_dump --table

What it would look like. Identify the ~90 Keycloak tables that contain rows scoped by realm (e.g. REALM, CLIENT, CLIENT_SCOPE, USER_ENTITY, CREDENTIAL, RESOURCE_SERVER, RESOURCE_SERVER_RESOURCE, COMPONENT, KEYCLOAK_ROLE, USER_ROLE_MAPPING, GROUP_ROLE_MAPPING, OFFLINE_USER_SESSION, …) and copy the rows where REALM_ID = <source> to the destination DB.

Why this is dangerous in a live cluster.

  1. Cache invalidation. A direct DB write does not produce a JGroups invalidation event. Live destination nodes that have ever cached a negative lookup for the new realm name will continue to return 404 until they are restarted or the cache TTL expires.

  2. Foreign-key graph is large and not flat. Keycloak's schema has ~90 tables (Keycloak forum). The realm-scoped subgraph includes RESOURCE_SERVER (tied to client UUIDs that must be preserved), COMPONENT_CONFIG (key-value config rows tied to component UUIDs), CLIENT_AUTH_FLOW_BINDINGS, IDENTITY_PROVIDER_CONFIG, AUTHENTICATION_EXECUTION (which references AUTHENTICATION_FLOW.ID, AUTHENTICATION_CONFIG.ID, and PARENT_FLOW), and USER_FEDERATION_*. Hand-crafting a transactional copy of this subgraph is feasible but brittle — any new table introduced by a future Keycloak version (e.g. organisations in 26.x — see keycloak#38258) silently breaks the procedure.

  3. Encryption keys live in components. KeyProvider components store the realm signing material in COMPONENT_CONFIG rows, including private keys. Copying these is correct (preserves token signatures), but missing them is a silent bug — newly issued tokens will be signed with destination keys while the realm advertises the source keys' KIDs, breaking JWKS verification.

  4. Schema drift. Source and destination clusters may be on different Keycloak patch versions. Even within 26.5.x, Liquibase changesets may differ. A row-level copy can succeed at the SQL level and still be schema-incompatible at the Java level.

  5. Online-schema migration warning from upstream. The Keycloak upgrading guide explicitly warns that index creation during upgrades may need manual application; this assumes cluster operators control the schema evolution. A manual realm copy that runs around this assumption can leave indexes out of sync.

Pros

  • Theoretically, the fastest mechanism for tenants with millions of users.

  • Independent of Keycloak's feature flags or admin-API versioning.

Cons

  • Unsupported by Keycloak. No documentation, no upstream test coverage.

  • Cluster-cache hazard unless every destination node is restarted after the copy — which reintroduces downtime for every tenant on the destination cluster.

  • Schema-drift fragility between Keycloak patch versions.

  • Encryption-key correctness is not enforced by anything.

  • Future-proofing risk: new tables in Keycloak patch releases will silently be omitted.

Verdict. Not recommended. If this option is ever selected, it must be combined with a destination-cluster rolling restart, which defeats the zero-downtime objective for the other tenants on the cluster.

6. What Happens Step-by-Step During Option C — Detailed Walkthrough

This section explains, in detail, what occurs at every layer of the system when the recommended migration script runs. This is the level of detail needed to gain operations approval and to debug problems during a real migration.

6.1 The actors involved

  • Source cluster — the Keycloak cluster the tenant is being moved from. Plays no active role during destination import; only its database clone is used to generate the export bundle.

  • DB clone of source — a temporary PostgreSQL instance restored from a backup of the source cluster. Used purely as a read source for kc.sh export.

  • Migration runner — a workstation, jumpbox, or CI pod that has network reach to (a) the DB clone and (b) the destination cluster's Admin REST endpoint.

  • Destination cluster — the Keycloak cluster the tenant is being moved to. Has N nodes (typically 2 or 3) behind a load balancer. Stays running throughout.

  • Destination Postgres — the destination cluster's primary database. Receives the new realm rows when the destination Keycloak processes the API call.

  • Master-realm admin client — the existing admin service-account on the destination cluster (KC_ADMIN_CLIENT_ID / KC_ADMIN_CLIENT_SECRET) that the script uses to authenticate. No new credentials are needed.

6.2 Walkthrough — what happens at each step

Step 1 — Generate the export bundle on the DB clone

The migration runner triggers kc.sh export --realm <tenant> --users different_files --dir /export against the DB clone (not the live source DB).

Inside the JVM that runs the export:

  • Keycloak boots in "export mode" — a special Quarkus profile that loads the realm provider and JPA datasource but does not open the HTTP port.

  • It reads the requested realm's rows from Postgres into a RealmRepresentation object using the same JPA mappers the live server uses for reads.

  • It serialises the RealmRepresentation to <tenant>-realm.json and writes user batches to <tenant>-users-0.json, <tenant>-users-1.json, …

  • The JVM exits.

What ends up in the bundle:

  • <tenant>-realm.json — every realm-scoped configuration item: realm settings, clients, client scopes (with mappings), authentication flows, authenticator configs, required actions, identity providers, IdP mappers, groups, realm/client roles, components (including KeyProvider components with private signing keys), authorization settings on each client, organisations, SMTP, themes, password policy, OTP policy, browser security headers, events config.

  • <tenant>-users-N.json — user records with their credentials (password hashes, OTP secrets, federated-identity links), roles, group memberships, attributes.

Why the clone matters: an export from the live source DB risks consistency issues (e.g., a user being added during the read pass). The clone is a frozen snapshot. This is unchanged from current FOLIO practice.

Step 2 — Pre-flight checks on the destination

Before any write, the script verifies:

  • Network reach — destination Admin endpoint responds to GET /health/ready.

  • Realm absenceGET /admin/realms/<tenant> returns 404 Not Found. If it returns 200, the script aborts. This is the fundamental safety check that keeps us in the "no overwrite" code path.

  • Token acquisition — script obtains an admin access token using the destination's KC_ADMIN_CLIENT_ID / KC_ADMIN_CLIENT_SECRET via client_credentials grant against master/protocol/openid-connect/token.

  • Bundle sanity — script verifies the export contains a components array with at least one entry of provider type org.keycloak.keys.KeyProvider (otherwise refresh tokens issued by the source will not validate after migration).

Step 3 — Strip users from the realm body

Before posting the realm, the script removes any users and federatedUsers arrays from <tenant>-realm.json and writes a body file <tenant>-realm-no-users.json. Reasons:

  • The default Quarkus body limit on the Admin REST endpoint is 10 MB; large user arrays push the body over that limit.

  • Streaming users separately allows the script to chunk them into batches and recover from per-batch failures.

The realm body remaining after the strip still contains everything else (clients, flows, components, …), so realm fidelity is not affected.

Step 4 — POST the realm body

The script issues:

POST https://<destination>/admin/realms Authorization: Bearer <admin-token> Content-Type: application/json Body: <tenant>-realm-no-users.json

The load balancer routes this to one of the destination Keycloak nodes — call it node-1.

What node-1 does, in order:

  1. Authenticates the request. The bearer token is validated against the master realm's signing keys. The script's caller is identified as a master-realm admin.

  2. Authorizes the request. Master-realm admins have the create-realm role; the request is allowed.

  3. Begins a JTA transaction that wraps the entire import.

  4. Calls DefaultExportImportManager.importRealm(InputStream). This iterates the JSON, creating each entity in dependency order: realm row → roles → client scopes → clients → authentication flows → authenticator configs → required actions → components (including key providers) → identity providers + mappers → groups → authorization settings on clients → organisations → realm-default config (themes, SMTP, password policy, …).

  5. Each entity creation goes through the normal KeycloakSession providers. This means side effects fire correctly — for example, creating a KeyProvider component triggers key registration in the realm's key cache.