ARCH-346: Keycloak resource optimization

ARCH-346: Keycloak resource optimization

Overview

Old Permission Model

New Permission Model

Performance Impact on Authorization Requests

Will it have any performance impact?
Yes, and it should be highly positive for read/evaluation (runtime authorization) while introducing some overhead on writes/assignments.

  • Runtime Authorization (Positive Impact): The authorization check uses the UMA ticket grant (grant_type=urn:ietf:params:oauth:grant-type:uma-ticket). Keycloak's policy evaluation engine performs significantly better when there are fewer permission objects to scan and evaluate. By consolidating policies under permission placeholders (e.g., grouping all roles/users that can access /users/{id}#GET into aggregated policies assigned to a single permission), Keycloak spends less time resolving the 1-1-1 relationships during the token exchange. This directly reduces latency for the API call you provided.

  • Assignment/Write Operations (Negative Impact / Bottleneck): Because mod-roles-keycloak is deployed in HA mode, concurrent role/user assignments could lead to race conditions or database deadlocks in Keycloak when multiple containers attempt to update the same permission or policy simultaneously (e.g., adding multiple users to the same permission policy at the exact same time).

How Difficult Will it Be to Implement?

Difficulty Level: Medium to High

The primary complexity does not lie in creating the Keycloak resources, but rather in handling the distributed state and concurrency in your HA deployment without relying on a message broker.

  • Concurrency Management: Since you cannot use messaging or sticky sessions (routing to a single container), you will need to implement distributed locking. You can achieve this using tools you likely already have in your stack:

    • Database-level locks: E.g., pessimistic locking or Postgres advisory locks if mod-roles-keycloak shares a database.

    • Redis/Hazelcast distributed locks: Using something like Redisson if Redis is available in the FOLIO environment.

    • Without locks, Keycloak's Admin API will likely throw 409 Conflict or concurrent modification exceptions when multiple pods try to update the exact same policy simultaneously. You will need to implement robust retry mechanisms with exponential backoff to handle these 409s gracefully. However it’s possible that during parallel updates to the same entity to loss data

  • API Interactions: Implementing the "check if exists, create if not, link" logic requires multiple REST calls to the Keycloak Admin API. This makes the assignment process stateful and relatively slow. Bulk operations or Keycloak extensions might be needed if assignment performance becomes a bottleneck.

Fail-Safe Policy Assignment Options

1. Database as a Queue (The Outbox / Inbox Pattern)

Instead of updating Keycloak synchronously during the HTTP request, the HA containers write the intent to update into a database table, and a background process handles the actual Keycloak API calls.

How it works:

  • A request comes into Pod A to assign a user to a permission.

  • Pod A saves this assignment to its local DB (e.g., table keycloak_sync_tasks) and immediately returns a success response to the caller.

  • A scheduled background worker polls this table. To prevent multiple pods from processing the same tasks, you use a DB lock mechanism (like SELECT ... FOR UPDATE SKIP LOCKED in PostgreSQL).

  • The worker groups tasks by permission and updates Keycloak sequentially, completely avoiding concurrent modification conflicts.

API response times are incredibly fast. Completely eliminates race conditions in Keycloak. Handles bulk updates beautifully.

Introduces Eventual Consistency. There will be a slight delay (milliseconds to seconds) between the user getting the role and Keycloak actually enforcing it.

2. The Custom Keycloak SPI (Atomic Backend Operation)

Instead of forcing mod-roles-keycloak to handle the GET -> PUT logic over HTTP, you can write a lightweight custom Keycloak REST API Extension (SPI).

How it works: You deploy a custom JAR to Keycloak that exposes a new endpoint, e.g., POST /realms/{realm}/custom-authz/permissions/{permId}/policies/{policyId}.

Why it fixes the problem: Inside this Java code running within Keycloak, you can utilize Keycloak's internal JPA provider to perform the update within a single database transaction. The Keycloak database will handle row-level locking natively, completely eliminating the HTTP race condition.

Difficulty: Requires Java/Keycloak development knowledge and managing a custom plugin in your Keycloak deployment.

3. Distributed Locking via the Database (Advisory Locks)

If you require Strict Consistency (Keycloak must be updated before the HTTP request returns to the user), you can use the database to serialize access to specific Keycloak resources.

How it works: PostgreSQL (and MySQL) supports "Advisory Locks"—application-level locks managed by the database.

  • Pod A and Pod B receive requests to update the same permission simultaneously.

  • Both attempt to acquire a database advisory lock using a unique identifier (e.g., a hash of the tenant_id + permission_name).

  • Pod A gets the lock, makes the REST call to Keycloak, and releases the lock.

  • Pod B waits at the database level until Pod A releases the lock, then it proceeds to fetch the latest state from Keycloak and make its update.

Strict consistency. Guarantees that only one thread across the entire HA cluster is modifying a specific Keycloak permission at any given time.

Threads are blocked waiting for the lock. If Keycloak is slow, it slows down your API responses.

Assignment Explanation

Keycloak requires a Read-Modify-Write cycle: you must perform a GET to fetch the permission, append the new policy ID to the JSON array in memory, and then perform a full PUT to replace the entity. If two mod-roles-keycloak pods execute this GET -> PUT cycle concurrently on a popular endpoint, you will experience Lost Updates (Pod A overwrites Pod B's addition). Furthermore, sending a PUT payload with hundreds of policy IDs for every assignment is heavy and inefficient.

Migration Approach

Migrating from a large number of 1-1-1 permissions to the new aggregated model on a live production system requires careful planning.

Suggested Migration Strategy: Blue/Green or Phased Migration

  • Preparation Phase (Code Delivery)

    • Deploy the updated mgr-tenant-entitlements to start generating the new permission placeholders (scopes, resources, and empty permissions) alongside the existing structure.

    • Deploy the updated mod-roles-keycloak with dual-write/read capabilities, or feature-flagged to use the old approach until ready.

  • Data Migration (Background Job)

    • Create a migration job within mod-roles-keycloak (executed via API trigger) that iterates through existing tenants.

    • For each tenant, it reads the old 1-1-1 permissions, maps the users/roles, creates the new aggregated policies, and links them to the new placeholders.

    • Crucial: This must be done in batches during off-peak hours to avoid overloading the Keycloak database.

  • Cutover Phase

    • Flip a configuration toggle (feature flag) in mod-roles-keycloak to start using the new permission structure for all new assignments.

  • Cleanup Phase

    • Once the new system is verified and stable, run a cleanup script to delete the legacy 1-1-1 permissions and policies to finally reclaim the performance gains.

Plain and consolidated authorization permissions do not conflict and can deploy simultaneously.

Authorization requests will behave as expected during the transition.

After migrating to the consolidated approach, start the cleanup job

Implementation Approach

Current approach

New Approach (Outbox)

New Approach (Advisory Lock)

Conclusion

  1. Performance tests result shows authorization requests improvement and speed-up in populating/removing realms

  2. Distributed Locking via the Database (Advisory Locks) - is the chosen approach to ensure data consistency in Keycloak, discussed on Weekly Eureka Architecture Meeting