MODROLESKC-380 Spike - Technical implementation plan for cleanup of orphaned records and keycloak permission reuse

MODROLESKC-380 Spike - Technical implementation plan for cleanup of orphaned records and keycloak permission reuse

https://folio-org.atlassian.net/browse/MODROLESKC-380

Executive Summary

When roles are deleted in FOLIO/Keycloak, their associated role policies, scope permissions, and resource permissions are not removed from Keycloak's authorization tables. Over time this creates significant bloat — thousands of orphaned records that degrade performance, complicate debugging, and may cause unexpected authorization behaviour. This document defines three alternative technical approaches for cleaning up these orphaned records across all realms, compares their trade-offs, and provides low-level designs that can be translated into user stories. The cleanup of orphaned records going forward (i.e., at role-deletion time) is out of scope and will be handled in a separate story.

Background: How Orphaned Records Are Created

In mod-roles-keycloak, when a capability or capability set is assigned to a role, the module creates corresponding Keycloak authorization objects:

  • A role policy (type role) referencing the Keycloak role ID, named Policy for role: {roleId}.

  • Scope permissions and/or resource permissions that reference the role policy via the associated_policy table.

  • Supporting rows in policy_config, scope_policy, resource_policy, etc.

When a role is deleted, Keycloak removes the role itself from the keycloak_role table but does not cascade to the authorization objects. The role policy's policy_config still contains the old role UUID in a JSON blob, the permissions still reference the dead policy, and all these rows remain permanently.

Approach 1: Direct SQL Cleanup (Current Script)

Description

A PostgreSQL script that identifies dead role policies by parsing the JSON in policy_config.value, cross-referencing against keycloak_role, then deletes all orphaned policies and their dependent permissions in the correct FK order.

How It Works

  1. Extract all role UUIDs referenced in policy_config for policies of type role.

  2. LEFT JOIN against keycloak_role — if zero referenced roles still exist, the policy is "dead."

  3. Find permissions (type scope or resource) where all associated policies are dead.

  4. Delete in FK order: resource_server_perm_ticketscope_policyresource_policyassociated_policypolicy_configresource_server_policy.

Script

-- ==================================================================== -- COMMIT (GUI-safe): performs deletions and commits. -- -- Use in tools like DBeaver. -- -- Before running this in production: -- 1) Run the DRY RUN script and review the preview list -- 2) Backup / snapshot -- 3) Run in a maintenance window (deletes + locks) -- -- Optional scope: -- - Edit tmp_params.client_id_like to limit by Keycloak client_id -- Example: '%-application' or 'diku%-application' -- ==================================================================== -- SAFETY LATCH: -- To run this script you must edit the next line to: -- IF 'YES' <> 'YES' THEN ... DO $$ BEGIN IF 'CHANGE_ME_TO_YES' <> 'YES' THEN RAISE EXCEPTION 'Refusing to run: set the safety latch to YES in delete_dead_role_policies_and_permissions_commit.sql'; END IF; END $$; BEGIN; CREATE TEMP TABLE tmp_params ( client_id_like text NOT NULL ) ON COMMIT DROP; -- Default: scan everything -- Default (safer): scan only *-application clients INSERT INTO tmp_params (client_id_like) VALUES ('%-application'); CREATE TEMP TABLE tmp_role_policy_role_refs ( role_policy_id varchar(36) NOT NULL, role_policy_name varchar(255) NOT NULL, resource_server_id varchar(36) NOT NULL, client_id varchar(255) NOT NULL, realm_name varchar(255), role_id varchar(36) NOT NULL ) ON COMMIT DROP; INSERT INTO tmp_role_policy_role_refs (role_policy_id, role_policy_name, resource_server_id, client_id, realm_name, role_id) SELECT rsp.id AS role_policy_id, rsp.name AS role_policy_name, rsp.resource_server_id, c.client_id, r.name AS realm_name, m[1] AS role_id FROM resource_server_policy rsp JOIN policy_config pc ON pc.policy_id = rsp.id AND pc.name = 'roles' JOIN client c ON c.id = rsp.resource_server_id AND c.client_id LIKE (SELECT client_id_like FROM tmp_params) LEFT JOIN realm r ON r.id = c.realm_id CROSS JOIN LATERAL regexp_matches( COALESCE(pc.value, ''), '"id"\s*:\s*"([0-9a-fA-F-]{36})"', 'g' ) AS m WHERE rsp.type = 'role'; CREATE TEMP TABLE tmp_dead_role_policies ( role_policy_id varchar(36) PRIMARY KEY, role_policy_name varchar(255) NOT NULL, resource_server_id varchar(36) NOT NULL, client_id varchar(255) NOT NULL, realm_name varchar(255), role_ids_total bigint NOT NULL, role_ids_existing bigint NOT NULL ) ON COMMIT DROP; INSERT INTO tmp_dead_role_policies (role_policy_id, role_policy_name, resource_server_id, client_id, realm_name, role_ids_total, role_ids_existing) SELECT rpr.role_policy_id, rpr.role_policy_name, rpr.resource_server_id, rpr.client_id, rpr.realm_name, COUNT(DISTINCT rpr.role_id)::bigint AS role_ids_total, COUNT(DISTINCT rpr.role_id) FILTER (WHERE kr.id IS NOT NULL)::bigint AS role_ids_existing FROM tmp_role_policy_role_refs rpr LEFT JOIN keycloak_role kr ON kr.id = rpr.role_id GROUP BY rpr.role_policy_id, rpr.role_policy_name, rpr.resource_server_id, rpr.client_id, rpr.realm_name HAVING COUNT(DISTINCT rpr.role_id) > 0 AND COUNT(DISTINCT rpr.role_id) FILTER (WHERE kr.id IS NOT NULL) = 0; CREATE TEMP TABLE tmp_dead_permissions ( permission_id varchar(36) PRIMARY KEY, permission_name varchar(255) NOT NULL, permission_type varchar(255) NOT NULL, resource_server_id varchar(36) NOT NULL, client_id varchar(255) NOT NULL, realm_name varchar(255) ) ON COMMIT DROP; INSERT INTO tmp_dead_permissions (permission_id, permission_name, permission_type, resource_server_id, client_id, realm_name) SELECT perm.id AS permission_id, perm.name AS permission_name, perm.type AS permission_type, perm.resource_server_id, c.client_id, r.name AS realm_name FROM resource_server_policy perm JOIN client c ON c.id = perm.resource_server_id AND c.client_id LIKE (SELECT client_id_like FROM tmp_params) LEFT JOIN realm r ON r.id = c.realm_id WHERE perm.type IN ('scope', 'resource') AND EXISTS ( SELECT 1 FROM associated_policy ap JOIN tmp_dead_role_policies drp ON drp.role_policy_id = ap.associated_policy_id AND drp.resource_server_id = perm.resource_server_id WHERE ap.policy_id = perm.id ) AND NOT EXISTS ( SELECT 1 FROM associated_policy ap LEFT JOIN tmp_dead_role_policies drp ON drp.role_policy_id = ap.associated_policy_id AND drp.resource_server_id = perm.resource_server_id WHERE ap.policy_id = perm.id AND drp.role_policy_id IS NULL ); CREATE TEMP TABLE tmp_policies_to_delete ( resource_server_id varchar(36) NOT NULL, realm_name varchar(255), client_id varchar(255) NOT NULL, policy_id varchar(36) PRIMARY KEY, policy_type varchar(255) NOT NULL, policy_name varchar(255) NOT NULL ) ON COMMIT DROP; INSERT INTO tmp_policies_to_delete (resource_server_id, realm_name, client_id, policy_id, policy_type, policy_name) SELECT rsp.resource_server_id, r.name AS realm_name, c.client_id, rsp.id AS policy_id, rsp.type AS policy_type, rsp.name AS policy_name FROM ( SELECT drp.resource_server_id, drp.role_policy_id AS policy_id FROM tmp_dead_role_policies drp UNION SELECT dp.resource_server_id, dp.permission_id AS policy_id FROM tmp_dead_permissions dp ) ids JOIN resource_server_policy rsp ON rsp.id = ids.policy_id AND rsp.resource_server_id = ids.resource_server_id JOIN client c ON c.id = rsp.resource_server_id AND c.client_id LIKE (SELECT client_id_like FROM tmp_params) LEFT JOIN realm r ON r.id = c.realm_id WHERE rsp.type IN ('role', 'scope', 'resource'); -- ===================== -- PREVIEW OUTPUTS -- (keep these; they will appear in DBeaver before the delete) -- ===================== SELECT realm_name AS "Realm", client_id AS "Client", role_policy_name AS "Dead Role Policy", role_policy_id AS "Role Policy ID", role_ids_total AS "Role IDs (total)", role_ids_existing AS "Role IDs (existing)" FROM tmp_dead_role_policies ORDER BY realm_name NULLS LAST, client_id, role_policy_name; SELECT COUNT(*) FILTER (WHERE policy_type = 'role') AS "Role Policies", COUNT(*) FILTER (WHERE policy_type IN ('scope','resource')) AS "Permissions", COUNT(*) AS "Total Policies" FROM tmp_policies_to_delete; -- ===================== -- DELETE (ordered by FKs) -- ===================== DELETE FROM resource_server_perm_ticket t USING tmp_policies_to_delete d WHERE t.policy_id = d.policy_id; DELETE FROM scope_policy sp USING tmp_policies_to_delete d WHERE sp.policy_id = d.policy_id; DELETE FROM resource_policy rp USING tmp_policies_to_delete d WHERE rp.policy_id = d.policy_id; DELETE FROM associated_policy ap USING tmp_policies_to_delete d WHERE ap.policy_id = d.policy_id OR ap.associated_policy_id = d.policy_id; DELETE FROM policy_config pc USING tmp_policies_to_delete d WHERE pc.policy_id = d.policy_id; DELETE FROM resource_server_policy rsp USING tmp_policies_to_delete d WHERE rsp.id = d.policy_id; COMMIT;

Advantages

  • Fast, efficient, operates across all realms/clients in a single transaction.

  • Already implemented and tested.

  • Includes safety latch, dry-run mode, and preview output.

Disadvantages

  • Bypasses Keycloak's Infinispan authorization cache — requires a Keycloak restart or cache clear after execution to avoid stale cache entries.​

  • Tight coupling to Keycloak's internal DB schema; schema changes in future Keycloak versions could break the script.

  • Requires direct database access and a maintenance window (locks during delete).

  • Does not use any Keycloak-supported API.

Approach 2: Keycloak Admin REST API via Bash Script

Description

A bash/curl script that iterates over all realms and *-application clients, lists policies and permissions via the Keycloak Admin REST API, identifies orphans, and deletes them one by one through the API.

Relevant API Endpoints

The Keycloak Admin REST API provides the following endpoints for authorisation management:

Operation

Endpoint

Operation

Endpoint

List realms

GET /admin/realms

List clients

GET /admin/realms/{realm}/clients?clientId={pattern}&first=0&max=1000

List policies

GET /admin/realms/{realm}/clients/{client-uuid}/authz/resource-server/policy?type=role&first=0&max=1000

Get policy detail

GET /admin/realms/{realm}/clients/{client-uuid}/authz/resource-server/policy/{policy-id}

List permissions

GET /admin/realms/{realm}/clients/{client-uuid}/authz/resource-server/permission?first=0&max=1000

Get permission associated policies

GET /admin/realms/{realm}/clients/{client-uuid}/authz/resource-server/policy/{perm-id}/associatedPolicies

Delete policy/permission

DELETE /admin/realms/{realm}/clients/{client-uuid}/authz/resource-server/policy/{policy-id}

List realm roles

GET /admin/realms/{realm}/roles

Script Logic (Pseudocode)

bash #!/bin/bash # 1. Obtain admin token TOKEN=$(curl -s -d "client_id=admin-cli" -d "username=$ADMIN" \ -d "password=$PASS" -d "grant_type=password" \ "$KC_URL/realms/master/protocol/openid-connect/token" | jq -r '.access_token') # 2. For each realm for REALM in $(curl -s -H "Authorization: Bearer $TOKEN" "$KC_URL/admin/realms" | jq -r '.[].realm'); do # 3. Get all existing role IDs in this realm ROLE_IDS=$(curl -s -H "Authorization: Bearer $TOKEN" \ "$KC_URL/admin/realms/$REALM/roles" | jq -r '.[].id') # 4. For each *-application client for CLIENT_UUID in $(curl -s -H "Authorization: Bearer $TOKEN" \ "$KC_URL/admin/realms/$REALM/clients?max=1000" \ | jq -r '.[] | select(.clientId | endswith("-application")) | .id'); do # 5. List all role-type policies POLICIES=$(curl -s -H "Authorization: Bearer $TOKEN" \ "$KC_URL/admin/realms/$REALM/clients/$CLIENT_UUID/authz/resource-server/policy?type=role&first=0&max=5000") # 6. For each policy, check if referenced roles still exist for POLICY_ID in $(echo "$POLICIES" | jq -r '.[].id'); do POLICY_DETAIL=$(curl -s -H "Authorization: Bearer $TOKEN" \ "$KC_URL/admin/realms/$REALM/clients/$CLIENT_UUID/authz/resource-server/policy/role/$POLICY_ID") REFERENCED_ROLES=$(echo "$POLICY_DETAIL" | jq -r '.roles[].id') ALL_DEAD=true for RID in $REFERENCED_ROLES; do if echo "$ROLE_IDS" | grep -q "$RID"; then ALL_DEAD=false; break; fi done if $ALL_DEAD; then # 7. Find and delete dependent permissions first # ... (check associatedPolicies for each permission) # 8. Delete the orphaned policy curl -s -X DELETE -H "Authorization: Bearer $TOKEN" \ "$KC_URL/admin/realms/$REALM/clients/$CLIENT_UUID/authz/resource-server/policy/$POLICY_ID" fi done done done

Advantages

  • Uses Keycloak's supported API — Keycloak handles cache invalidation automatically.

  • No direct DB access required; works through the admin endpoint.

  • Safe for clustered deployments — no cache coherence issues.

  • Can be run from any machine with network access to Keycloak admin.

Disadvantages

  • Slow: No bulk delete API exists; each policy/permission must be deleted individually via HTTP.​

  • For environments with thousands of orphaned records across many realms, this can take minutes to hours.

  • Token refresh logic needed for long-running scripts.

Approach 3: mod-roles-keycloak Asynchronous Cluster-Safe Cleanup Task

Description

Embed the cleanup logic directly into mod-roles-keycloak as an asynchronous, cluster-safe background job. The module already uses the Keycloak Admin Client Java library to manage roles and authorization policies. The cleanup runs after startup in the background and uses a distributed lock so that, in a multi-instance deployment, only one instance performs the cleanup at a time.

Execution model

  • The job is triggered via Spring Boot @Scheduled or ApplicationReadyEvent but immediately offloads work to a separate executor (@Async or custom TaskExecutor) so module startup is not blocked.​

  • To avoid duplicate execution when multiple instances of mod-roles-keycloak are running, a distributed lock is used, backed by the shared PostgreSQL database. Libraries like ShedLock are a good fit: they use a lock table and ensure a given job runs only once across the cluster.

High-level flow

  1. On startup, mod-roles-keycloak schedules OrphanedPolicyCleanupJob with @Scheduled (run once shortly after start, or at a low-traffic cron time).

  2. OrphanedPolicyCleanupJob attempts to acquire a distributed lock (e.g., ORPHANED_POLICY_CLEANUP) in the shared DB.

  3. If the lock is acquired, the job runs asynchronously:

    • For each entitled realm/tenant:

      • Locate the *-application client.

      • Enumerate Keycloak role policies and referenced roles via Keycloak Admin Client.

      • Detect policies whose referenced roles no longer exist.

      • Find dependent permissions whose associated policies are all dead.

      • Delete dependent permissions first, then delete dead role policies.

  4. If the lock is not acquired, the instance simply skips the run (another pod is already doing the cleanup).

  5. On completion, the job releases the lock and logs metrics (per realm counts, duration).

Cluster-safe implementation sketch (ShedLock-style)

java @Configuration @EnableScheduling @EnableAsync public class SchedulingConfig { @Bean public TaskExecutor cleanupExecutor() { ThreadPoolTaskExecutor exec = new ThreadPoolTaskExecutor(); exec.setCorePoolSize(2); exec.setMaxPoolSize(4); exec.setThreadNamePrefix("orphan-cleanup-"); exec.initialize(); return exec; } // If using ShedLock with JDBC // @Bean // public LockProvider lockProvider(DataSource dataSource) { // return new JdbcTemplateLockProvider( // JdbcTemplateLockProvider.Configuration.builder() // .withJdbcTemplate(new JdbcTemplate(dataSource)) // .usingDbTime() // to avoid clock skew // .build() // ); // } }
java @Service @RequiredArgsConstructor @Slf4j public class OrphanedPolicyCleanupJob { private final OrphanedPolicyCleanupService cleanupService; private final TaskExecutor cleanupExecutor; // run once after startup or via cron, but don't block the caller @Scheduled(initialDelayString = "PT2M", fixedDelayString = "PT24H") public void scheduleCleanup() { cleanupExecutor.execute(this::runClusterSafeCleanup); } // If using ShedLock: // @SchedulerLock(name = "ORPHANED_POLICY_CLEANUP", lockAtMostFor = "PT1H", lockAtLeastFor = "PT1M") private void runClusterSafeCleanup() { try { log.info("Starting orphaned policy cleanup (cluster-safe)..."); cleanupService.cleanupAllRealms(); log.info("Finished orphaned policy cleanup."); } catch (Exception e) { log.error("Orphaned policy cleanup failed", e); } } }

Advantages

  • Non-blocking startup: cleanup runs asynchronously; mod-roles-keycloak can start serving requests immediately.

  • Cluster-safe: distributed lock ensures that even with multiple instances/pods, cleanup runs only once per scheduled window.

  • Cache-safe: all operations go through the Keycloak Admin Client, so Keycloak handles cache invalidation automatically.​

  • Configurable: schedule, enable/disable flag, and lock timeouts can be adjusted per environment.

Disadvantages

  • Requires a code change and release of mod-roles-keycloak.

  • Adds startup time; needs careful error handling so a cleanup failure does not block module startup.

  • Must handle pagination for realms/clients with very large numbers of policies.

  • Performance depends on Keycloak Admin API throughput (same per-item limitation as Approach 2).

Comparison Matrix

Criteria

Approach 1: SQL Script

Approach 2: Bash/API

Approach 3: mod-roles-keycloak

Criteria

Approach 1: SQL Script

Approach 2: Bash/API

Approach 3: mod-roles-keycloak

Speed for bulk cleanup

Very fast (single transaction)

Slow (per-item HTTP)

Moderate (per-item via Java client)

Cache safety

Requires restart/clear

Automatic

Automatic

Direct DB access needed

Yes

No

No

Maintenance window

Required

Not required

Not required

Schema coupling

High (breaks on KC upgrade)

None (uses API)

None (uses API)

Automation potential

Low (manual trigger)

Medium (cron/scheduled)

High (startup/scheduled)

Code change required

No

No (script only)

Yes (mod-roles-keycloak PR)

Audit logging

Manual (SQL output)

Script logging

Application logging

Multi-realm support

Yes (all at once)

Yes (iterates)

Yes (iterates entitled realms)

Conclusion

The analysis considered three implementation options for cleaning up orphaned Keycloak authorization records: a direct SQL script, a bash script using the Keycloak Admin REST API, and an integrated cleanup task inside mod-roles-keycloak. Each option addresses the same functional goal but differs in coupling, operational complexity, and impact on existing services.

For the FOLIO deployment, the bash script using the Keycloak Admin REST API is the most pragmatic solution. It runs outside of the module lifecycle, so it does not affect mod-roles-keycloak startup time, does not require feature flags or additional configuration toggles, and does not introduce new runtime behaviours into existing modules. It leverages Keycloak’s supported admin endpoints, which ensures proper cache invalidation and avoids tight coupling to the internal Keycloak database schema, at the cost of a slower, per-item deletion process. Operationally, the script can be executed on demand (for example, as part of a maintenance runbook) whenever orphaned records need to be cleaned, without any code changes or module redeployments.