PTF-Test Hybrid DB Deployment on multiple(30) tenants[Galileo environment]. Provision DB + Serverless V2. Part 2
Overview
The primary objective of this testing initiative is to evaluate the viability, performance, and cost-efficiency of a hybrid database configuration. By combining a standard provisioned database (db.r7g.4xlarge or db.r7g.2xlarge ) for steady-state operations with an Aurora Serverless V2 cluster (configured for 0.5-32 ACUs) for unpredictable, spiky workloads, we aim to maintain system stability while significantly reducing infrastructure costs.
Ticket:
https://folio-org.atlassian.net/browse/PERF-1229
Summary
Test 1:
Architecture: Provisioned
db.r7g.4xlargepaired with an Aurora Serverless V2 cluster (1-4 ACUs).Routing: modules (
mod-source-record-managerandmod-source-record-storage) were isolated and routed to the Serverless DB.Observation: The primary provisioned database (
4xlarge) showed significant resource over-provisioning, utilizing only ~35% of its available CPU capacity during the execution and 100% CPU utilization on the Serverless V2 cluster.Action Taken: To better align with actual compute demands and maximize cost efficiency, we scale down the provisioned database to a
db.r7g.2xlargeinstance for subsequent testing phases.
Test 2:
Architecture: Provisioned
db.r7g.2xlargepaired with Aurora Serverless V2 (initially configured for 1-4 ACUs).Routing: modules (
mod-source-record-managerandmod-source-record-storage) routed to the Serverless DB.Provisioned DB Performance: The downscaled
2xlargeinstance operated at approximately 65% CPU utilization. We also observed a minor reduction in Response Times (RT) for certain workflows, confirming this size is optimal for the main workload.Serverless DB Bottleneck: The Aurora Serverless cluster peaked at 100% CPU utilization. This indicates that a maximum of 4 ACUs is insufficient to handle the load.
Action Taken: We decided to increase the maximum capacity of the Serverless cluster by 2 units, adjusting the configuration to 0.5-6 ACUs for the next testing phase.
Test 3:
Architecture: Provisioned
db.r7g.2xlargepaired with an expanded Aurora Serverless V2 cluster. We adjusted the configuration to 0.5–6 ACUs, lowering the minimum limit from 1 to 0.5 to optimize cost savings during idle periods, while increasing the maximum to 6 to handle peak loads.Routing:
mod-source-record-managerandmod-source-record-storageremained routed to the Serverless DB.System Performance Improvements: Despite the ongoing serverless constraints, expanding the maximum capacity yielded significant improvements in overall Response Times (RT) compared to Test 2. Most notably, Data Import execution time dropped from 52 minutes to 34 minutes, and circulation workflows (Check-In/Check-Out) saw measurable latency reductions, about -2 seconds.
Serverless DB Bottleneck: Despite the increased capacity limits and the observed workflow improvements, the Aurora Serverless cluster still peaked at 100% CPU utilization.
Conclusion & Next Steps: While the 0.5–6 ACU configuration improved transaction times, the persistent 100% CPU bottleneck on the serverless cluster suggests further capacity scaling.
Test 4:
Architecture: Provisioned
db.r7g.2xlargepaired with Aurora Serverless V2 (0.5–6 ACUs).Routing:
mod-source-record-managerandmod-source-record-storageremained routed to the Serverless DB.Workload Optimization (Action Taken): In previous iterations, 3 separate 50K Data Imports were initiated at the exact same time, creating a high spike in the database. To change this behavior, we introduced a 120-second ramp-up period to stagger the start times of DI jobs.
Performance Comparison (Test 3 vs. Test 4):
Data Import: Execution time improved further, dropping from 34 minutes to 31 minutes.
Bulk Edit Improvements: Upload times improved noticeably across all record types (e.g., Holdings upload dropped from 81 sec to 67 sec). The Bulk Edit (Edit) for Users also saw a major improvement, dropping from 62 sec to 41 sec.
Circulation (CICO) response times saw minor improvements (Check-Out dropped from 8.7 sec to 8.2 sec).
Test 5:
Architecture: Provisioned
db.r7g.2xlargepaired with Aurora Serverless V2 (0.5–6 ACUs), retaining the staggered Data Import optimizations introduced in Test 4.Routing: In an attempt to further offload the primary 2xl database, we routed a third heavy module,
mod-inventory-storage, to the Serverless DB (joiningsrmandsrs).System Instability & Errors: We executed this specific scenario twice to validate the behavior. During both iterations, the system experienced severe instability, returning HTTP 5xx response codes across multiple workflows.
Serverless DB Bottleneck: The test results demonstrate that the 0.5–6 ACU capacity limit is too restrictive to handle the combined concurrent connections and compute demands of these 3 modules.
Conclusion: Routing
mod-inventory-storageto a serverless cluster capped at 6 ACUs is not a viable configuration under heavy load. We need to increase the Serverless capacity limit to 128 ACUs for the subsequent test execution (Test 6).
Test 6:
Architecture: Provisioned
db.r7g.2xlargepaired with an Aurora Serverless V2 cluster with an expanded upper limit (0.5–128 ACUs).Routing: 3 modules (
mod-source-record-manager,mod-source-record-storage, andmod-inventory-storage) routed to the Serverless DB.Serverless Scaling Behavior: Removing the restrictive 6 ACU cap successfully eliminated the HTTP 500 errors and system crashes observed in Test 5. Under peak concurrent load, the serverless cluster dynamically scaled to a maximum of approximately 30 ACUs.
Performance Impact:
Data Import (50K - Create): dropping from 31 minutes down to 18 minutes
Data Export: The Custom profile completed in 16.2 minutes (down from 17.5 min), and the Default profile finished in just 3.5 minutes (down from 5.2 min).
Next Steps: Because the serverless cluster peaked at ~30 ACUs, maintaining a maximum limit of 128 ACUs is unnecessary. To establish a cost-optimized configuration, our next step is to reduce the Serverless capacity ceiling down to 32 ACUs.
Test 7:
Architecture: Provisioned
db.r7g.2xlargepaired with an Aurora Serverless V2 cluster. Based on the ~30 ACU peak observed in Test 6, we strategically lowered the maximum capacity ceiling to 0.5–32 ACUs .Routing: 3 modules (
mod-source-record-manager,mod-source-record-storage, andmod-inventory-storage) routed to the Serverless DB.Performance Comparison (Test 6 vs. Test 7): Capping the Serverless cluster at 32 ACUs did not negatively impact performance.
Conclusion: The 0.5–32 ACU Serverless configuration represents the optimal "sweet spot" for this hybrid architecture.
Test 8:
Architecture: Provisioned
db.r7g.2xlargepaired with Aurora Serverless V2 (capped at 0.5–32 ACUs).Routing Strategy: 4 modules,
mod-orders-storagewas routed to the Serverless cluster, joiningmod-source-record-manager,mod-source-record-storage, andmod-inventory-storage.Conclusion: While core workflows remained stable, routing the fourth module to the 32 ACU cluster introduced an average 5-15% performance degradation across most workflows, indicating the compute ceiling on Serverless DB has been reached
Test 9:
Architecture: Provisioned db.r7g.2xlarge paired with Aurora Serverless V2 (capped at 0.5–32 ACUs).
Routing: 5 modules (mod-source-record-manager, mod-source-record-storage, mod-inventory-storage, mod-orders-storage, and mod-data-export) routed to the Serverless DB.
Performance Impact: Data Export (Custom profile) improved significantly from 16 minutes down to 11 minutes.
Data Import (50K - Create) execution time increased from 19 minutes to 21 minutes.
General workflows (Circulation and Bulk Edit) experienced noticeable degradation due to resource contention.
Conclusion: The 32 ACU capacity ceiling is over-saturated with five heavy modules. While Data Export accelerated, the resulting compute starvation across other workflows indicates the ACU limit must be raised to support this routing configuration.
Test 10:
Architecture: Provisioned
db.r7g.2xlargepaired with Aurora Serverless V2 (capped at 0.5–32 ACUs).Routing: 7 modules (
mod-source-record-manager,mod-source-record-storage,mod-inventory-storage,mod-orders-storage,mod-data-export,mod-lists, andmod-fqm-manager) routed to the Serverless DB.Performance Impact (Test 9 vs. Test 10): Several workflows slow down significantly. Data Import increased from 21 min to 24 min, and Custom Data Export lost its previous gains, regressing from 11 min back to 18 min.
Resource Utilization:
Capacity Ceiling Hit: The
ServerlessDatabaseCapacitygraph shows the cluster scaling up and flatlining exactly at the maximum 32 ACU limit.High Compute Load: Serverless CPU utilization spiked to ~80%
Conclusion: The 32 ACU capacity ceiling is definitely insufficient for 7 heavy modules. To alleviate this compute starvation, the next step is to double the Serverless maximum capacity to 0.5–64 ACUs (Test 11).
Test 11:
Architecture: Provisioned
db.r7g.2xlargepaired with an expanded Aurora Serverless V2 cluster (capped at 0.5–64 ACUs).Routing: 7 modules (
srm,srs,inv-strg,ord-strg,data-export,lists, andfqm-manager) routed to the Serverless DB.Performance Impact: Despite doubling the capacity to 64 ACUs, the metrics show the cluster only dynamically scaled to a peak of ~30 ACUs. The database did not utilize the extra resources provided.
Data Export showed a notable recovery (Custom profile dropped from 18 min to 15 min), and Data Import marginally improved from 24 min to 23 min, though it remains far above the 18-minute optimal baseline.
Persistent Bottlenecks: Resolving the ACU cap did not fix tail latencies.
Conclusion: Expanding the limit to 64 ACUs proves that throwing more compute capacity at the Serverless cluster does not solve the performance degradation caused by routing 7 heavy modules together. Because the cluster peaked at ~30 ACUs.
Test 12:
Architecture: Provisioned
db.r7g.2xlargepaired with an expanded Aurora Serverless V2 cluster. The configuration was adjusted to a 5–64 ACU range, raising the minimum capacity from 0.5 to 5 ACUs.Routing: 7 modules (
srm,srs,inv-strg,ord-strg,data-export,lists, andfqm-manager) routed to the Serverless DB.Objective: This test was specifically executed to determine if raising the minimum ACU baseline mitigates auto-scaling lag (the time it takes for the database to provision new resources from a near-zero state) and improves overall system responsiveness.
Performance Impact: Establishing a "warm" minimum baseline of 5 ACUs resulted in improvements compared to Test 11:
Data Import (50K - Create) is dropping from 23 minutes back to the optimal 18-minute mark.
CICO: Check-In dropped from 4.42 sec to 2.78 sec, and Check-Out improved to 4.02 sec.
Conclusion: Raising the minimum capacity to 5 ACUs eliminates the database "cold start" delay. This warm baseline provides immediate resources, allowing the system to handle heavy sudden workloads (like Data Import and CICO) much faster.
Tests 10, 11, and 12 were executed utilizing custom sidecar code optimizations developed by Olamide to specifically address severe Check-In/Check-Out (CICO) latency observed in heavy multi-tenant environments. Trace analysis had previously revealed a critical authorization cache thrashing issue where token expiration timestamps (exp) were unnecessarily included in cache keys, leading to constant cache misses upon 10-minute token refreshes and massive cache pollution from duplicate service tokens across 50+ tenants. https://folio-org.atlassian.net/browse/MODSIDECAR-185
Test 13: Baseline Retest (Without Sidecar CICO Optimizations)
Architecture: Provisioned
db.r7g.2xlargepaired with an Aurora Serverless V2 cluster (capped at 5–64 ACUs).Routing: 7 modules (
srm,srs,inv-strg,ord-strg,data-export,lists, andfqm-manager) routed to the Serverless DB.Objective: This test was executed as a direct control comparison (A/B test) against Test 12. It uses the exact same 5 ACU minimum baseline but removes Olamide's custom sidecar code optimizations.
Performance Impact (Test 12 vs. Test 13): Reverting the sidecar optimizations caused immediate and severe performance degradation across almost all workflows, proving that the sidecar overhead acts as a massive systemic resource drain:
CICO : Without the cache fix, real-time circulation suffered immediately. Check-In slowed from 2.78 sec to 4.2 sec, and Check-Out regressed from 4.02 sec to 5.89 sec.
Data Import (50K - Create) execution time spiked dramatically from 18 minutes to 26 minutes.
Conclusion: Test 13 definitively proves that the custom sidecar CICO optimization is absolutely mandatory for this environment.
Recommendations & Jiras
Next testing scenarious: https://folio-org.atlassian.net/browse/PERF-1334
Authorization cache thrashing issue: https://folio-org.atlassian.net/browse/MODSIDECAR-185
Test Runs
Baseline :
Database: 4xl Provisioned.
Note: Main configuration without adding Serverless DB.
Test 1(r7g.4xl, srs and srm ->1-4 ACU) :
Database:
db.r7g.4xlargeProvisioned + Serverless (1-4 ACU).Routing: mod-source-record-manager + mod-source-record-storage modules connected to Serverless DB.
Test 2(r7g.2xl, srs and srm ->1-4 ACU):
Database: 2xl Provisioned + Serverless (1-4 ACU).
Routing: mod-source-record-manager + mod-source-record-storage modules connected to Serverless DB.
Test 3 (r7g.2xl, srs and srm ->0.5-6 ACU):
Database: 2xl Provisioned + Serverless (0.5–6 ACU).
Routing: mod-source-record-manager + mod-source-record-storage modules connected to Serverless DB.
Test 4 (r7g.2xl, srs and srm ->0.5-6 ACU and Data Import Optimization):
Database: 2xl Provisioned + Serverless (0.5–6 ACU) + DI (Data Import rump-up period changed 120 sec, default =0 sec) optimizations.
Routing: mod-source-record-manager + mod-source-record-storage modules connected to Serverless DB.
Test 5(r7g.2xl, srs, srm and inv-strg->0.5-6 ACU):
Database: 2xl Provisioned + Serverless (0.5–6 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage modules connected to Serverless DB.
Test 6(r7g.2xl, srs, srm and inv-strg->0.5-128 ACU):
Database: 2xl Provisioned + Serverless (0.5–128 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage modules connected to Serverless DB.
Test 7(r7g.2xl, srs, srm and inv-strg->0.5-32 ACU):
Database: 2xl Provisioned + Serverless (0.5–32 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage modules connected to Serverless DB.
Test 8(r7g.2xl, srs, srm, inv-strg and ord-strg->0.5-32 ACU):
Database: 2xl Provisioned + Serverless (0.5–32 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage+mod-orders-storage modules connected to Serverless DB.
Test 9(r7g.2xl, srs, srm, inv-strg, data-export and ord-strg->0.5-32 ACU):
Database: 2xl Provisioned + Serverless (0.5–32 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage+mod-orders-storage + mod-data-export modules connected to Serverless DB.
Test 10(r7g.2xl, srs, srm, inv-strg, data-export, ord-strg, lists and fqm-manager->0.5-32 ACU):
Database: 2xl Provisioned + Serverless (0.5–32 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage+mod-orders-storage + mod-data-export + mod-lists + mod-fqm-manager modules connected to Serverless DB.
Test 11(r7g.2xl, srs, srm, inv-strg, data-export, ord-strg, lists and fqm-manager->0.5-64 ACU):
Database: 2xl Provisioned + Serverless (0.5–64 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage+mod-orders-storage + mod-data-export + mod-lists + mod-fqm-manager modules connected to Serverless DB.
Test 12(r7g.2xl, srs, srm, inv-strg, data-export, ord-strg, lists and fqm-manager->5-64 ACU):
Database: 2xl Provisioned + Serverless (5–64 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage+mod-orders-storage + mod-data-export + mod-lists + mod-fqm-manager modules connected to Serverless DB.
Test 13(r7g.2xl, srs, srm, inv-strg, data-export, ord-strg, lists and fqm-manager->5-64 ACU):
Database: 2xl Provisioned + Serverless (5–64 ACU) + DI.
Routing: mod-source-record-manager + mod-source-record-storage + mod-inventory-storage+mod-orders-storage + mod-data-export + mod-lists + mod-fqm-manager modules connected to Serverless DB.
Comments: Retest 12 with sidecars version 3.0.18nb(without Olamide fix)
Test scenario:
8 Tenants: Check-In/Check-Out (CICO);
3 Tenants: Executed Data Import (50K, Profile: PTF-Create-3);
3 Tenants: Bulk Edit for holdings, users, and items with upload and edit operations;
2 Tenants: Executed Data Export workflows with Custom and Default profiles;
2 Tenants: Executed Harvesting workflows(OAI - PMH);
2 Tenants: Executed Refresh Lists workflows.
Results
SCENARIO | Baseline | Test 1 | Test 2 | Test 3 | Test 4 | Test 5 | Test 6 | Test 7 | Test 8 | Test 9 | Test 10 | Test 11 | Test 12 | Test 13 | |
Check | In | 5,4 sec | 2,7 sec | 7,8 sec | 6,3 sec | 6,1 sec | failed | 8 sec | 6,2 sec | 7,12 sec | 7,63 sec | 4,56 sec | |||