MODROLESKC-380 Spike - Technical implementation plan for cleanup of orphaned records and keycloak permission reuse
https://folio-org.atlassian.net/browse/MODROLESKC-380
- 1 Executive Summary
- 2 Background: How Orphaned Records Are Created
- 3 Approach 1: Direct SQL Cleanup (Current Script)
- 3.1 Description
- 3.2 How It Works
- 3.3 Script
- 3.4 Advantages
- 3.5 Disadvantages
- 4 Approach 2: Keycloak Admin REST API via Bash Script
- 5 Script Logic (Pseudocode)
- 5.1 Advantages
- 5.2 Disadvantages
- 6 Approach 3: mod-roles-keycloak Asynchronous Cluster-Safe Cleanup Task
- 7 Comparison Matrix
- 8 Conclusion
Executive Summary
When roles are deleted in FOLIO/Keycloak, their associated role policies, scope permissions, and resource permissions are not removed from Keycloak's authorization tables. Over time this creates significant bloat — thousands of orphaned records that degrade performance, complicate debugging, and may cause unexpected authorization behaviour. This document defines three alternative technical approaches for cleaning up these orphaned records across all realms, compares their trade-offs, and provides low-level designs that can be translated into user stories. The cleanup of orphaned records going forward (i.e., at role-deletion time) is out of scope and will be handled in a separate story.
Background: How Orphaned Records Are Created
In mod-roles-keycloak, when a capability or capability set is assigned to a role, the module creates corresponding Keycloak authorization objects:
A role policy (type
role) referencing the Keycloak role ID, namedPolicy for role: {roleId}.Scope permissions and/or resource permissions that reference the role policy via the
associated_policytable.Supporting rows in
policy_config,scope_policy,resource_policy, etc.
When a role is deleted, Keycloak removes the role itself from the keycloak_role table but does not cascade to the authorization objects. The role policy's policy_config still contains the old role UUID in a JSON blob, the permissions still reference the dead policy, and all these rows remain permanently.
Approach 1: Direct SQL Cleanup (Current Script)
Description
A PostgreSQL script that identifies dead role policies by parsing the JSON in policy_config.value, cross-referencing against keycloak_role, then deletes all orphaned policies and their dependent permissions in the correct FK order.
How It Works
Extract all role UUIDs referenced in
policy_configfor policies of typerole.LEFT JOIN against
keycloak_role— if zero referenced roles still exist, the policy is "dead."Find permissions (type
scopeorresource) where all associated policies are dead.Delete in FK order:
resource_server_perm_ticket→scope_policy→resource_policy→associated_policy→policy_config→resource_server_policy.
Script
-- ====================================================================
-- COMMIT (GUI-safe): performs deletions and commits.
--
-- Use in tools like DBeaver.
--
-- Before running this in production:
-- 1) Run the DRY RUN script and review the preview list
-- 2) Backup / snapshot
-- 3) Run in a maintenance window (deletes + locks)
--
-- Optional scope:
-- - Edit tmp_params.client_id_like to limit by Keycloak client_id
-- Example: '%-application' or 'diku%-application'
-- ====================================================================
-- SAFETY LATCH:
-- To run this script you must edit the next line to:
-- IF 'YES' <> 'YES' THEN ...
DO $$
BEGIN
IF 'CHANGE_ME_TO_YES' <> 'YES' THEN
RAISE EXCEPTION 'Refusing to run: set the safety latch to YES in delete_dead_role_policies_and_permissions_commit.sql';
END IF;
END $$;
BEGIN;
CREATE TEMP TABLE tmp_params (
client_id_like text NOT NULL
) ON COMMIT DROP;
-- Default: scan everything
-- Default (safer): scan only *-application clients
INSERT INTO tmp_params (client_id_like) VALUES ('%-application');
CREATE TEMP TABLE tmp_role_policy_role_refs (
role_policy_id varchar(36) NOT NULL,
role_policy_name varchar(255) NOT NULL,
resource_server_id varchar(36) NOT NULL,
client_id varchar(255) NOT NULL,
realm_name varchar(255),
role_id varchar(36) NOT NULL
) ON COMMIT DROP;
INSERT INTO tmp_role_policy_role_refs (role_policy_id, role_policy_name, resource_server_id, client_id, realm_name, role_id)
SELECT
rsp.id AS role_policy_id,
rsp.name AS role_policy_name,
rsp.resource_server_id,
c.client_id,
r.name AS realm_name,
m[1] AS role_id
FROM resource_server_policy rsp
JOIN policy_config pc ON pc.policy_id = rsp.id AND pc.name = 'roles'
JOIN client c ON c.id = rsp.resource_server_id
AND c.client_id LIKE (SELECT client_id_like FROM tmp_params)
LEFT JOIN realm r ON r.id = c.realm_id
CROSS JOIN LATERAL regexp_matches(
COALESCE(pc.value, ''),
'"id"\s*:\s*"([0-9a-fA-F-]{36})"',
'g'
) AS m
WHERE rsp.type = 'role';
CREATE TEMP TABLE tmp_dead_role_policies (
role_policy_id varchar(36) PRIMARY KEY,
role_policy_name varchar(255) NOT NULL,
resource_server_id varchar(36) NOT NULL,
client_id varchar(255) NOT NULL,
realm_name varchar(255),
role_ids_total bigint NOT NULL,
role_ids_existing bigint NOT NULL
) ON COMMIT DROP;
INSERT INTO tmp_dead_role_policies (role_policy_id, role_policy_name, resource_server_id, client_id, realm_name, role_ids_total, role_ids_existing)
SELECT
rpr.role_policy_id,
rpr.role_policy_name,
rpr.resource_server_id,
rpr.client_id,
rpr.realm_name,
COUNT(DISTINCT rpr.role_id)::bigint AS role_ids_total,
COUNT(DISTINCT rpr.role_id) FILTER (WHERE kr.id IS NOT NULL)::bigint AS role_ids_existing
FROM tmp_role_policy_role_refs rpr
LEFT JOIN keycloak_role kr ON kr.id = rpr.role_id
GROUP BY rpr.role_policy_id, rpr.role_policy_name, rpr.resource_server_id, rpr.client_id, rpr.realm_name
HAVING COUNT(DISTINCT rpr.role_id) > 0
AND COUNT(DISTINCT rpr.role_id) FILTER (WHERE kr.id IS NOT NULL) = 0;
CREATE TEMP TABLE tmp_dead_permissions (
permission_id varchar(36) PRIMARY KEY,
permission_name varchar(255) NOT NULL,
permission_type varchar(255) NOT NULL,
resource_server_id varchar(36) NOT NULL,
client_id varchar(255) NOT NULL,
realm_name varchar(255)
) ON COMMIT DROP;
INSERT INTO tmp_dead_permissions (permission_id, permission_name, permission_type, resource_server_id, client_id, realm_name)
SELECT
perm.id AS permission_id,
perm.name AS permission_name,
perm.type AS permission_type,
perm.resource_server_id,
c.client_id,
r.name AS realm_name
FROM resource_server_policy perm
JOIN client c ON c.id = perm.resource_server_id
AND c.client_id LIKE (SELECT client_id_like FROM tmp_params)
LEFT JOIN realm r ON r.id = c.realm_id
WHERE perm.type IN ('scope', 'resource')
AND EXISTS (
SELECT 1
FROM associated_policy ap
JOIN tmp_dead_role_policies drp ON drp.role_policy_id = ap.associated_policy_id
AND drp.resource_server_id = perm.resource_server_id
WHERE ap.policy_id = perm.id
)
AND NOT EXISTS (
SELECT 1
FROM associated_policy ap
LEFT JOIN tmp_dead_role_policies drp ON drp.role_policy_id = ap.associated_policy_id
AND drp.resource_server_id = perm.resource_server_id
WHERE ap.policy_id = perm.id
AND drp.role_policy_id IS NULL
);
CREATE TEMP TABLE tmp_policies_to_delete (
resource_server_id varchar(36) NOT NULL,
realm_name varchar(255),
client_id varchar(255) NOT NULL,
policy_id varchar(36) PRIMARY KEY,
policy_type varchar(255) NOT NULL,
policy_name varchar(255) NOT NULL
) ON COMMIT DROP;
INSERT INTO tmp_policies_to_delete (resource_server_id, realm_name, client_id, policy_id, policy_type, policy_name)
SELECT
rsp.resource_server_id,
r.name AS realm_name,
c.client_id,
rsp.id AS policy_id,
rsp.type AS policy_type,
rsp.name AS policy_name
FROM (
SELECT drp.resource_server_id, drp.role_policy_id AS policy_id
FROM tmp_dead_role_policies drp
UNION
SELECT dp.resource_server_id, dp.permission_id AS policy_id
FROM tmp_dead_permissions dp
) ids
JOIN resource_server_policy rsp ON rsp.id = ids.policy_id
AND rsp.resource_server_id = ids.resource_server_id
JOIN client c ON c.id = rsp.resource_server_id
AND c.client_id LIKE (SELECT client_id_like FROM tmp_params)
LEFT JOIN realm r ON r.id = c.realm_id
WHERE rsp.type IN ('role', 'scope', 'resource');
-- =====================
-- PREVIEW OUTPUTS
-- (keep these; they will appear in DBeaver before the delete)
-- =====================
SELECT
realm_name AS "Realm",
client_id AS "Client",
role_policy_name AS "Dead Role Policy",
role_policy_id AS "Role Policy ID",
role_ids_total AS "Role IDs (total)",
role_ids_existing AS "Role IDs (existing)"
FROM tmp_dead_role_policies
ORDER BY realm_name NULLS LAST, client_id, role_policy_name;
SELECT
COUNT(*) FILTER (WHERE policy_type = 'role') AS "Role Policies",
COUNT(*) FILTER (WHERE policy_type IN ('scope','resource')) AS "Permissions",
COUNT(*) AS "Total Policies"
FROM tmp_policies_to_delete;
-- =====================
-- DELETE (ordered by FKs)
-- =====================
DELETE FROM resource_server_perm_ticket t
USING tmp_policies_to_delete d
WHERE t.policy_id = d.policy_id;
DELETE FROM scope_policy sp
USING tmp_policies_to_delete d
WHERE sp.policy_id = d.policy_id;
DELETE FROM resource_policy rp
USING tmp_policies_to_delete d
WHERE rp.policy_id = d.policy_id;
DELETE FROM associated_policy ap
USING tmp_policies_to_delete d
WHERE ap.policy_id = d.policy_id
OR ap.associated_policy_id = d.policy_id;
DELETE FROM policy_config pc
USING tmp_policies_to_delete d
WHERE pc.policy_id = d.policy_id;
DELETE FROM resource_server_policy rsp
USING tmp_policies_to_delete d
WHERE rsp.id = d.policy_id;
COMMIT;
Advantages
Fast, efficient, operates across all realms/clients in a single transaction.
Already implemented and tested.
Includes safety latch, dry-run mode, and preview output.
Disadvantages
Bypasses Keycloak's Infinispan authorization cache — requires a Keycloak restart or cache clear after execution to avoid stale cache entries.
Tight coupling to Keycloak's internal DB schema; schema changes in future Keycloak versions could break the script.
Requires direct database access and a maintenance window (locks during delete).
Does not use any Keycloak-supported API.
Approach 2: Keycloak Admin REST API via Bash Script
Description
A bash/curl script that iterates over all realms and *-application clients, lists policies and permissions via the Keycloak Admin REST API, identifies orphans, and deletes them one by one through the API.
Relevant API Endpoints
The Keycloak Admin REST API provides the following endpoints for authorisation management:
Operation | Endpoint |
|---|---|
List realms |
|
List clients |
|
List policies |
|
Get policy detail |
|
List permissions |
|
Get permission associated policies |
|
Delete policy/permission |
|
List realm roles |
|
Script Logic (Pseudocode)
bash
#!/bin/bash
# 1. Obtain admin token
TOKEN=$(curl -s -d "client_id=admin-cli" -d "username=$ADMIN" \
-d "password=$PASS" -d "grant_type=password" \
"$KC_URL/realms/master/protocol/openid-connect/token" | jq -r '.access_token')
# 2. For each realm
for REALM in $(curl -s -H "Authorization: Bearer $TOKEN" "$KC_URL/admin/realms" | jq -r '.[].realm'); do
# 3. Get all existing role IDs in this realm
ROLE_IDS=$(curl -s -H "Authorization: Bearer $TOKEN" \
"$KC_URL/admin/realms/$REALM/roles" | jq -r '.[].id')
# 4. For each *-application client
for CLIENT_UUID in $(curl -s -H "Authorization: Bearer $TOKEN" \
"$KC_URL/admin/realms/$REALM/clients?max=1000" \
| jq -r '.[] | select(.clientId | endswith("-application")) | .id'); do
# 5. List all role-type policies
POLICIES=$(curl -s -H "Authorization: Bearer $TOKEN" \
"$KC_URL/admin/realms/$REALM/clients/$CLIENT_UUID/authz/resource-server/policy?type=role&first=0&max=5000")
# 6. For each policy, check if referenced roles still exist
for POLICY_ID in $(echo "$POLICIES" | jq -r '.[].id'); do
POLICY_DETAIL=$(curl -s -H "Authorization: Bearer $TOKEN" \
"$KC_URL/admin/realms/$REALM/clients/$CLIENT_UUID/authz/resource-server/policy/role/$POLICY_ID")
REFERENCED_ROLES=$(echo "$POLICY_DETAIL" | jq -r '.roles[].id')
ALL_DEAD=true
for RID in $REFERENCED_ROLES; do
if echo "$ROLE_IDS" | grep -q "$RID"; then ALL_DEAD=false; break; fi
done
if $ALL_DEAD; then
# 7. Find and delete dependent permissions first
# ... (check associatedPolicies for each permission)
# 8. Delete the orphaned policy
curl -s -X DELETE -H "Authorization: Bearer $TOKEN" \
"$KC_URL/admin/realms/$REALM/clients/$CLIENT_UUID/authz/resource-server/policy/$POLICY_ID"
fi
done
done
done
Advantages
Uses Keycloak's supported API — Keycloak handles cache invalidation automatically.
No direct DB access required; works through the admin endpoint.
Safe for clustered deployments — no cache coherence issues.
Can be run from any machine with network access to Keycloak admin.
Disadvantages
Slow: No bulk delete API exists; each policy/permission must be deleted individually via HTTP.
For environments with thousands of orphaned records across many realms, this can take minutes to hours.
Token refresh logic needed for long-running scripts.
Approach 3: mod-roles-keycloak Asynchronous Cluster-Safe Cleanup Task
Description
Embed the cleanup logic directly into mod-roles-keycloak as an asynchronous, cluster-safe background job. The module already uses the Keycloak Admin Client Java library to manage roles and authorization policies. The cleanup runs after startup in the background and uses a distributed lock so that, in a multi-instance deployment, only one instance performs the cleanup at a time.
Execution model
The job is triggered via Spring Boot
@ScheduledorApplicationReadyEventbut immediately offloads work to a separate executor (@Asyncor customTaskExecutor) so module startup is not blocked.To avoid duplicate execution when multiple instances of
mod-roles-keycloakare running, a distributed lock is used, backed by the shared PostgreSQL database. Libraries like ShedLock are a good fit: they use a lock table and ensure a given job runs only once across the cluster.
High-level flow
On startup,
mod-roles-keycloakschedulesOrphanedPolicyCleanupJobwith@Scheduled(run once shortly after start, or at a low-traffic cron time).OrphanedPolicyCleanupJobattempts to acquire a distributed lock (e.g.,ORPHANED_POLICY_CLEANUP) in the shared DB.If the lock is acquired, the job runs asynchronously:
For each entitled realm/tenant:
Locate the
*-applicationclient.Enumerate Keycloak role policies and referenced roles via Keycloak Admin Client.
Detect policies whose referenced roles no longer exist.
Find dependent permissions whose associated policies are all dead.
Delete dependent permissions first, then delete dead role policies.
If the lock is not acquired, the instance simply skips the run (another pod is already doing the cleanup).
On completion, the job releases the lock and logs metrics (per realm counts, duration).
Cluster-safe implementation sketch (ShedLock-style)
java
@Configuration
@EnableScheduling
@EnableAsync
public class SchedulingConfig {
@Bean
public TaskExecutor cleanupExecutor() {
ThreadPoolTaskExecutor exec = new ThreadPoolTaskExecutor();
exec.setCorePoolSize(2);
exec.setMaxPoolSize(4);
exec.setThreadNamePrefix("orphan-cleanup-");
exec.initialize();
return exec;
}
// If using ShedLock with JDBC
// @Bean
// public LockProvider lockProvider(DataSource dataSource) {
// return new JdbcTemplateLockProvider(
// JdbcTemplateLockProvider.Configuration.builder()
// .withJdbcTemplate(new JdbcTemplate(dataSource))
// .usingDbTime() // to avoid clock skew
// .build()
// );
// }
}
java
@Service
@RequiredArgsConstructor
@Slf4j
public class OrphanedPolicyCleanupJob {
private final OrphanedPolicyCleanupService cleanupService;
private final TaskExecutor cleanupExecutor;
// run once after startup or via cron, but don't block the caller
@Scheduled(initialDelayString = "PT2M", fixedDelayString = "PT24H")
public void scheduleCleanup() {
cleanupExecutor.execute(this::runClusterSafeCleanup);
}
// If using ShedLock:
// @SchedulerLock(name = "ORPHANED_POLICY_CLEANUP", lockAtMostFor = "PT1H", lockAtLeastFor = "PT1M")
private void runClusterSafeCleanup() {
try {
log.info("Starting orphaned policy cleanup (cluster-safe)...");
cleanupService.cleanupAllRealms();
log.info("Finished orphaned policy cleanup.");
} catch (Exception e) {
log.error("Orphaned policy cleanup failed", e);
}
}
}
Advantages
Non-blocking startup: cleanup runs asynchronously;
mod-roles-keycloakcan start serving requests immediately.Cluster-safe: distributed lock ensures that even with multiple instances/pods, cleanup runs only once per scheduled window.
Cache-safe: all operations go through the Keycloak Admin Client, so Keycloak handles cache invalidation automatically.
Configurable: schedule, enable/disable flag, and lock timeouts can be adjusted per environment.
Disadvantages
Requires a code change and release of
mod-roles-keycloak.Adds startup time; needs careful error handling so a cleanup failure does not block module startup.
Must handle pagination for realms/clients with very large numbers of policies.
Performance depends on Keycloak Admin API throughput (same per-item limitation as Approach 2).
Comparison Matrix
Criteria | Approach 1: SQL Script | Approach 2: Bash/API | Approach 3: mod-roles-keycloak |
|---|---|---|---|
Speed for bulk cleanup | Very fast (single transaction) | Slow (per-item HTTP) | Moderate (per-item via Java client) |
Cache safety | Requires restart/clear | Automatic | Automatic |
Direct DB access needed | Yes | No | No |
Maintenance window | Required | Not required | Not required |
Schema coupling | High (breaks on KC upgrade) | None (uses API) | None (uses API) |
Automation potential | Low (manual trigger) | Medium (cron/scheduled) | High (startup/scheduled) |
Code change required | No | No (script only) | Yes (mod-roles-keycloak PR) |
Audit logging | Manual (SQL output) | Script logging | Application logging |
Multi-realm support | Yes (all at once) | Yes (iterates) | Yes (iterates entitled realms) |
Conclusion
The analysis considered three implementation options for cleaning up orphaned Keycloak authorization records: a direct SQL script, a bash script using the Keycloak Admin REST API, and an integrated cleanup task inside mod-roles-keycloak. Each option addresses the same functional goal but differs in coupling, operational complexity, and impact on existing services.
For the FOLIO deployment, the bash script using the Keycloak Admin REST API is the most pragmatic solution. It runs outside of the module lifecycle, so it does not affect mod-roles-keycloak startup time, does not require feature flags or additional configuration toggles, and does not introduce new runtime behaviours into existing modules. It leverages Keycloak’s supported admin endpoints, which ensures proper cache invalidation and avoids tight coupling to the internal Keycloak database schema, at the cost of a slower, per-item deletion process. Operationally, the script can be executed on demand (for example, as part of a maintenance runbook) whenever orphaned records need to be cleaned, without any code changes or module redeployments.