PTF- Testing folio-keyclock cluster mode

PTF- Testing folio-keyclock cluster mode




Overview

This test aims to validate the stability and High Availability (HA) capabilities of the folio-keycloak module running in clustered mode. Specifically, we are testing the transition from the custom JDBC_PING2 discovery protocol to the standard jdbc-ping. This is an experimental and regression test utilizing an unreleased folio-keycloak image built from a development branch.

The testing will be conducted on the MOBIUS environment, configured with 60 tenants and the MOBIUS dataset, to simulate a realistic multi-tenant Keycloak/Eureka architecture.

Testing Goals:

  • Baseline Establishment: Run a mixed-workload scenario (Authorization and roles creation) on the existing JDBC_PING2 setup to capture baseline metrics.

  • A/B Comparison: Deploy the custom branch (KEYCLOAK-111) with jdbc-ping and execute the same load. The goal is to verify cluster mode stability, confirm the new mechanism is active, and ensure no undesirable side effects are introduced.

  • Failover & Scaling Verification: Observe how Keycloak behaves during dynamic scaling under load. This includes scaling up (adding nodes), scaling down (gracefully removing nodes), and failover testing (forcefully stopping/killing nodes) to verify HA functionality.

Defined SLAs:

  • Performance: 0 regression. Response times (RT) with the new jdbc-ping configuration must not degrade compared to the baseline.

  • Failover Reliability: Aiming for zero dropped connections (0% errors) or minimal, acceptable momentary spikes in 5xx errors/RT during container termination and cluster recovery.

Jira Ticket: https://folio-org.atlassian.net/browse/PERF-1387

 

Summary

  1. High Availability (HA) & Failover: During the dynamic scaling and failover tests, the jdbc-ping cluster met the SLA. When forcefully terminating a Keycloak node, the system recovered successfully, resulting in an error rate of 0% before traffic.

  2. In test 2,3,4,5 Folio-keycloak works in cluster mode as expected.

  3. Error rate were 0% in all FailOver tests.

  4. We conducted three test runs for each configuration and analyzed the results. The requirement of 0 regression was met. The implementation of the Keycloak-111 branch introduces no systemic degradation to response times. Furthermore, high-volume core transactions demonstrated noticeable improvements in both average response times and 90th percentile (pct90) metrics compared to the baseline.



Service Level Agreement (SLA) Compliance & A/B Comparison

Goal: 0 Regression Response times (90th percentile) on the custom branch (jdbc-ping) must not degrade compared to the JDBC_PING2 baseline.

1. Core Authentication & Role Management Lifecycle

Transaction

SLA Status

TC_PARENT_LOGIN

Met

TC_Create role

Met

TC_Assign role to user

Met

TC_Delete role

Met

2. Failover Reliability

Goal: Minimal disruption during dynamic scaling and failover. Expectation is 0 dropped connections, or only momentary, acceptable spikes in 5xx errors while the cluster recovers.

Test Scenario

Defined SLA Target

Actual Observation

SLA Status

Notes

Scale Up (Add 2 nodes)

0 dropped connections / 0% error rate

[e.g., 0% errors]

Met

Cluster successfully rebalanced without dropping active sessions. See screenshot

Scale Down (Remove 2 nodes)

0 dropped connections / 0% error rate

[e.g., 0% errors]

Met

Existing sessions were successfully handed over. See screenshot

Failover (Force kill 2 node, 1 by 1)

0 dropped connections / 0% error rate

[e.g., 0% errors]

Met

Cluster successfully rebalanced. See screenshot

Test Runs

Test run

Folio-keycloak module version

Test duration

Notes

Test run

Folio-keycloak module version

Test duration

Notes

Test 1

folio-keycloak:26.5.1

15 minutes

Baseline test

Test 2

folio-keycloak:26.5.7.269

15 minutes

A/B comparison test. Test 1 vs Test 2

Test 3

folio-keycloak:26.5.7.269

15 minutes

Failover Testing. After 7 minutes of the test, add 2 f-k containers. F-K containers at the beginning of the test - 3

Test 4

folio-keycloak:26.5.7.269

15 minutes

Failover Testing. After 5 minutes of the test, remove 2 f-k containers. F-K containers at the beginning of the test - 5

Test 5

folio-keycloak:26.5.7.269

15 minutes

Failover Testing. After 5 minutes of the test, stop 1 f-k container; after another 5 minutes, stop 2nd f-k container. F-K containers at the beginning of the test - 3

Load Model

Workflow 1. Authorization

  • Total Concurrency: 10 threads (Virtual Users).

  • Multi-Tenant Distribution: The load is distributed evenly across 30 distinct tenants to simulate a realistic, multi-tenant environment.

  • User Account: A predefined dataset of 300 unique user accounts is utilized during the test execution.

Workflow 2. Role & Access Management

  • Total Concurrency: 1 thread (Virtual Users).

  • Multi-Tenant Distribution: The load is distributed evenly across 60 distinct tenants to simulate a realistic, multi-tenant environment.

The transaction flow includes the following steps:

  1. Create Role: Generating a new user role within the system.

  2. Select Application: Linking the role to a specific application context.

  3. Assign Capability-sets: Attaching specific permissions and capability-sets to the newly created role.

  4. Assign Role to User: Mapping the configured role to a random user from the account pool.

  5. Delete Role: Removing the role to clean up the environment and complete the lifecycle loop.




Test results

Tests

Test 1

Test 2

transaction

Number Of Samples

Avrg

pct90

Number Of Samples

Avrg

pct90

TC_Assign capability-sets app-acquisitions-1.0.28

12

22.11 s

25.04 s

8

26.16 s

29.13 s

TC_Assign capability-sets app-bulk-edit-1.0.9

11

2.68 s

3.51 s

12

3.66 s

4.25 s

TC_Assign capability-sets app-consortia-1.2.3

7

2.12 s

2.38 s

10

2.62 s

3.78 s

TC_Assign capability-sets app-dcb-1.1.10

11

1.04 s

1.74 s

1

1.76 s

1.76 s

TC_Assign capability-sets app-edge-complete-2.0.16

9

1.16 s

1.83 s

13

1.95 s

2.81 s

TC_Assign capability-sets app-edge-locate-1.1.4

13

480.97 ms

1.04 s

3

739.67 ms

1.02 s

TC_Assign capability-sets app-fqm-1.0.15

7

1.62 s

2.35 s

6

2.51 s

3.15 s

TC_Assign capability-sets app-marc-migrations-2.0.5

4

997.42 ms

1.21 s

4

1.58 s

2.86 s

TC_Assign capability-sets app-oai-pmh-1.0.3

11

1.26 s

1.69 s

3

2.48 s

3.28 s

TC_Assign capability-sets app-platform-complete-2.5.6

6

50.27 s

54.08 s

5

1.11 min

1.88 min

TC_Assign capability-sets app-platform-minimal-2.0.51

8

8.88 s

10.97 s

4

19.29 s

1.17 min

TC_Assign role to user

98

769.22 ms

1.07 s

69

1.47 s

3.25 s

TC_Create role

99

681.67 ms

1.81 s

70

2.20 s

4.21 s

TC_Delete role

98

100.28 ms

159.40 ms

69

163.41 ms

509 ms

TC_PARENT_LOGIN

50147

182.96 ms

289.90 ms

55791

166.85 ms

271.70 ms

TC_Select application ${random_app_id}

1

54 ms

54 ms

1

668 ms

668 ms

TC_Select application app-acquisitions-1.0.28

12

155.58 ms

350 ms

8

298.50 ms

412 ms

TC_Select application app-bulk-edit-1.0.9

11

161.27 ms

292 ms

12

259.44 ms

473 ms

TC_Select application app-consortia-1.2.3

7

98.78 ms

135 ms

10

166.16 ms

386 ms

TC_Select application app-dcb-1.1.10

11

76.89 ms

182 ms

1

90 ms

90 ms

TC_Select application app-edge-complete-2.0.16

8

87.84 ms

134 ms

13

166.22 ms

357 ms

TC_Select application app-edge-locate-1.1.4

13

116.59 ms

299 ms

3

157.17 ms

213 ms

TC_Select application app-fqm-1.0.15

7

71.34 ms

152 ms

6

276.04 ms

431 ms

TC_Select application app-marc-migrations-2.0.5

4

43.31 ms

95 ms

4

85.29 ms

191 ms

TC_Select application app-oai-pmh-1.0.3

11

84.28 ms

161.10 ms

3

391.44 ms

587 ms

TC_Select application app-platform-complete-2.5.6

6

71.31 ms

131 ms

5

152.64 ms

199 ms

TC_Select application app-platform-minimal-2.0.51

8

68.89 ms

84 ms

4

249.85 ms

462 ms

 

Keycloak-111

26.5.1

 

Test 1