PTF- Testing folio-keyclock cluster mode
Overview
This test aims to validate the stability and High Availability (HA) capabilities of the folio-keycloak module running in clustered mode. Specifically, we are testing the transition from the custom JDBC_PING2 discovery protocol to the standard jdbc-ping. This is an experimental and regression test utilizing an unreleased folio-keycloak image built from a development branch.
The testing will be conducted on the MOBIUS environment, configured with 60 tenants and the MOBIUS dataset, to simulate a realistic multi-tenant Keycloak/Eureka architecture.
Testing Goals:
Baseline Establishment: Run a mixed-workload scenario (Authorization and roles creation) on the existing
JDBC_PING2setup to capture baseline metrics.A/B Comparison: Deploy the custom branch (
KEYCLOAK-111) withjdbc-pingand execute the same load. The goal is to verify cluster mode stability, confirm the new mechanism is active, and ensure no undesirable side effects are introduced.Failover & Scaling Verification: Observe how Keycloak behaves during dynamic scaling under load. This includes scaling up (adding nodes), scaling down (gracefully removing nodes), and failover testing (forcefully stopping/killing nodes) to verify HA functionality.
Defined SLAs:
Performance: 0 regression. Response times (RT) with the new
jdbc-pingconfiguration must not degrade compared to the baseline.Failover Reliability: Aiming for zero dropped connections (0% errors) or minimal, acceptable momentary spikes in 5xx errors/RT during container termination and cluster recovery.
Jira Ticket: https://folio-org.atlassian.net/browse/PERF-1387
Summary
High Availability (HA) & Failover: During the dynamic scaling and failover tests, the
jdbc-pingcluster met the SLA. When forcefully terminating a Keycloak node, the system recovered successfully, resulting in an error rate of 0% before traffic.In test 2,3,4,5 Folio-keycloak works in cluster mode as expected.
Error rate were 0% in all FailOver tests.
We conducted three test runs for each configuration and analyzed the results. The requirement of 0 regression was met. The implementation of the
Keycloak-111branch introduces no systemic degradation to response times. Furthermore, high-volume core transactions demonstrated noticeable improvements in both average response times and 90th percentile (pct90) metrics compared to the baseline.
Service Level Agreement (SLA) Compliance & A/B Comparison
Goal: 0 Regression Response times (90th percentile) on the custom branch (jdbc-ping) must not degrade compared to the JDBC_PING2 baseline.
1. Core Authentication & Role Management Lifecycle
Transaction | SLA Status |
TC_PARENT_LOGIN | ✅ Met |
TC_Create role | ✅ Met |
TC_Assign role to user | ✅ Met |
TC_Delete role | ✅ Met |
2. Failover Reliability
Goal: Minimal disruption during dynamic scaling and failover. Expectation is 0 dropped connections, or only momentary, acceptable spikes in 5xx errors while the cluster recovers.
Test Scenario | Defined SLA Target | Actual Observation | SLA Status | Notes |
Scale Up (Add 2 nodes) | 0 dropped connections / 0% error rate | [e.g., 0% errors] | ✅ Met | Cluster successfully rebalanced without dropping active sessions. See screenshot |
Scale Down (Remove 2 nodes) | 0 dropped connections / 0% error rate | [e.g., 0% errors] | ✅ Met | Existing sessions were successfully handed over. See screenshot |
Failover (Force kill 2 node, 1 by 1) | 0 dropped connections / 0% error rate | [e.g., 0% errors] | ✅ Met | Cluster successfully rebalanced. See screenshot |
Test Runs
Test run | Folio-keycloak module version | Test duration | Notes |
|---|---|---|---|
Test 1 |
| 15 minutes | Baseline test |
Test 2 |
| 15 minutes | A/B comparison test. Test 1 vs Test 2 |
Test 3 |
| 15 minutes | Failover Testing. After 7 minutes of the test, add 2 f-k containers. F-K containers at the beginning of the test - 3 |
Test 4 |
| 15 minutes | Failover Testing. After 5 minutes of the test, remove 2 f-k containers. F-K containers at the beginning of the test - 5 |
Test 5 |
| 15 minutes | Failover Testing. After 5 minutes of the test, stop 1 f-k container; after another 5 minutes, stop 2nd f-k container. F-K containers at the beginning of the test - 3 |
Load Model
Workflow 1. Authorization
Total Concurrency: 10 threads (Virtual Users).
Multi-Tenant Distribution: The load is distributed evenly across 30 distinct tenants to simulate a realistic, multi-tenant environment.
User Account: A predefined dataset of 300 unique user accounts is utilized during the test execution.
Workflow 2. Role & Access Management
Total Concurrency: 1 thread (Virtual Users).
Multi-Tenant Distribution: The load is distributed evenly across 60 distinct tenants to simulate a realistic, multi-tenant environment.
The transaction flow includes the following steps:
Create Role: Generating a new user role within the system.
Select Application: Linking the role to a specific application context.
Assign Capability-sets: Attaching specific permissions and capability-sets to the newly created role.
Assign Role to User: Mapping the configured role to a random user from the account pool.
Delete Role: Removing the role to clean up the environment and complete the lifecycle loop.
Test results
Tests | Test 1 | Test 2 | ||||
transaction | Number Of Samples | Avrg | pct90 | Number Of Samples | Avrg | pct90 |
TC_Assign capability-sets app-acquisitions-1.0.28 | 12 | 22.11 s | 25.04 s | 8 | 26.16 s | 29.13 s |
TC_Assign capability-sets app-bulk-edit-1.0.9 | 11 | 2.68 s | 3.51 s | 12 | 3.66 s | 4.25 s |
TC_Assign capability-sets app-consortia-1.2.3 | 7 | 2.12 s | 2.38 s | 10 | 2.62 s | 3.78 s |
TC_Assign capability-sets app-dcb-1.1.10 | 11 | 1.04 s | 1.74 s | 1 | 1.76 s | 1.76 s |
TC_Assign capability-sets app-edge-complete-2.0.16 | 9 | 1.16 s | 1.83 s | 13 | 1.95 s | 2.81 s |
TC_Assign capability-sets app-edge-locate-1.1.4 | 13 | 480.97 ms | 1.04 s | 3 | 739.67 ms | 1.02 s |
TC_Assign capability-sets app-fqm-1.0.15 | 7 | 1.62 s | 2.35 s | 6 | 2.51 s | 3.15 s |
TC_Assign capability-sets app-marc-migrations-2.0.5 | 4 | 997.42 ms | 1.21 s | 4 | 1.58 s | 2.86 s |
TC_Assign capability-sets app-oai-pmh-1.0.3 | 11 | 1.26 s | 1.69 s | 3 | 2.48 s | 3.28 s |
TC_Assign capability-sets app-platform-complete-2.5.6 | 6 | 50.27 s | 54.08 s | 5 | 1.11 min | 1.88 min |
TC_Assign capability-sets app-platform-minimal-2.0.51 | 8 | 8.88 s | 10.97 s | 4 | 19.29 s | 1.17 min |
TC_Assign role to user | 98 | 769.22 ms | 1.07 s | 69 | 1.47 s | 3.25 s |
TC_Create role | 99 | 681.67 ms | 1.81 s | 70 | 2.20 s | 4.21 s |
TC_Delete role | 98 | 100.28 ms | 159.40 ms | 69 | 163.41 ms | 509 ms |
TC_PARENT_LOGIN | 50147 | 182.96 ms | 289.90 ms | 55791 | 166.85 ms | 271.70 ms |
TC_Select application ${random_app_id} | 1 | 54 ms | 54 ms | 1 | 668 ms | 668 ms |
TC_Select application app-acquisitions-1.0.28 | 12 | 155.58 ms | 350 ms | 8 | 298.50 ms | 412 ms |
TC_Select application app-bulk-edit-1.0.9 | 11 | 161.27 ms | 292 ms | 12 | 259.44 ms | 473 ms |
TC_Select application app-consortia-1.2.3 | 7 | 98.78 ms | 135 ms | 10 | 166.16 ms | 386 ms |
TC_Select application app-dcb-1.1.10 | 11 | 76.89 ms | 182 ms | 1 | 90 ms | 90 ms |
TC_Select application app-edge-complete-2.0.16 | 8 | 87.84 ms | 134 ms | 13 | 166.22 ms | 357 ms |
TC_Select application app-edge-locate-1.1.4 | 13 | 116.59 ms | 299 ms | 3 | 157.17 ms | 213 ms |
TC_Select application app-fqm-1.0.15 | 7 | 71.34 ms | 152 ms | 6 | 276.04 ms | 431 ms |
TC_Select application app-marc-migrations-2.0.5 | 4 | 43.31 ms | 95 ms | 4 | 85.29 ms | 191 ms |
TC_Select application app-oai-pmh-1.0.3 | 11 | 84.28 ms | 161.10 ms | 3 | 391.44 ms | 587 ms |
TC_Select application app-platform-complete-2.5.6 | 6 | 71.31 ms | 131 ms | 5 | 152.64 ms | 199 ms |
TC_Select application app-platform-minimal-2.0.51 | 8 | 68.89 ms | 84 ms | 4 | 249.85 ms | 462 ms |
| Keycloak-111 | 26.5.1 | ||||||||||||||||
| Test 1 | |||||||||||||||||