PTF-Test Hybrid DB Deployment on multiple tenants[Galileo environment]. Provisioned primary DB db.r7g.2xlarge paired with Dedicated Offload DB db.r8g.2xlarg. Part 3
Overview
This report details the performance evaluation of a Dual-Provisioned Database architecture within a multi-tenant FOLIO environment (Galileo release). The tested architecture features a Provisioned Primary DB (db.r7g.2xlarge) paired with a Dedicated Offload DB (db.r8g.2xlarge). 7 highly intensive, I/O-heavy modules - mod-source-record-manager, mod-source-record-storage, mod-inventory-storage, mod-orders-storage, mod-data-export, mod-lists, and mod-fqm-manager were routed to the Dedicated Offload DB(db.r8g.2xlarge). The objective of this testing phase is to determine if a dual-provisioned hardware baseline can successfully stabilize heavy Data Import, Data Export) without degrading real-time Circulation (CICO) and ListApp workflows across 60 tenants.
Ticket: https://folio-org.atlassian.net/browse/PERF-1334
Summary:
Test 3 - Standard Load: Under the standard load profile, the Dual-Provisioned setup easily outperforms the previous Serverless baseline. Data Import (50K) matched its optimal 18-minute benchmark, Data Export completed in 17 minutes, and CICO operations reduced by about 1 second. This confirms that the r8g.2xlarge instance can handle the concurrent load of all 7 heavy modules under normal operating conditions.
Test 4 applied a 3x load multiplier to stress-test the architecture's breaking point. CICO achieved its fastest times yet (Check-In at 2.40 seconds; Check-Out at 3.50 seconds). Data Import execution degraded to 57 minutes. Based on the Performance Insight graph for segcon-perf1334-2nddb (Dedicated Offload DB, db.r8g.2xlarge), the primary bottleneck identified is IO: DataFileRead, which dominated the wait events throughout the test, accounting for approximately 70–80% of Average Active Sessions (AAS). CPU utilization remained negligible throughout, confirming that compute was not the limiting factor. The degradation observed in Test 4 results — particularly the Data Import regression (18 min → 57 min) is consistent with this I/O saturation pattern on the Offload DB.
Test 5 applied a vertically scaled Dedicated Offload DB (db.r8g.4xlarge) under a high load profile (60 tenants) to determine if doubling the hardware capacity could mitigate the bottlenecks observed in Test 4. Despite the increased CPU and memory, the vertical scaling yielded no meaningful performance recovery. CICO remained stable (Check-In at 3.12 seconds; Check-Out at 4.37 seconds), also Data Import duration 59 minutes, and ListApp didin`t fail and latency reached 8 minutes. This confirms that throwing more physical hardware at the Offload DB does not resolve the degradation.
Recommendations & Jiras
Authorization cache thrashing issue: https://folio-org.atlassian.net/browse/MODSIDECAR-185
Previious report:
https://folio-org.atlassian.net/wiki/spaces/FOLIJET/pages/1807286320 ,
https://folio-org.atlassian.net/wiki/spaces/FOLIJET/pages/1404010497
Test Runs
The first two tests (Test 1 and Test 2), shown in the test results table, are from previous testing. Link to the report:
Test 1. Baseline r7g 4xl;
Test 2. Baseline r7g 2xl + SL 5-64ACU;
Test 3.
Architecture: Provisioned primary DB
db.r7g.2xlargepaired with Dedicated Offload DBdb.r8g.2xlarge.Load model: Test scenario 1: Medium load.
Routing: 7 modules (
srm,srs,inv-strg,ord-strg,data-export,lists, andfqm-manager) routed to Dedicated Offload DBdb.r8g.2xlarge.Number of tenants: 60;
Test 4.
Architecture: Provisioned primary DB
db.r7g.2xlargepaired with Dedicated Offload DBdb.r8g.2xlarge.Load model: Test scenario 2: High load.
Routing: 7 modules (
srm,srs,inv-strg,ord-strg,data-export,lists, andfqm-manager) routed to Dedicated Offload DBdb.r8g.2xlarge.Number of tenants: 60;
Test 5.
Architecture: Provisioned primary DB
db.r7g.2xlargepaired with Dedicated Offload DBdb.r8g.4xlarge.Load model: Test scenario 2: High load.
Routing: 7 modules (
srm,srs,inv-strg,ord-strg,data-export,lists, andfqm-manager) routed to Dedicated Offload DBdb.r8g.4xlarge.Number of tenants: 60;
Test scenarios:
Test scenario 1: Medium load
8 Tenants: Check-In/Check-Out (CICO);
3 Tenants: Executed Data Import (50K, Profile: PTF-Create-3);
3 Tenants: Bulk Edit for holdings, users, and items with upload and edit operations;
2 Tenants: Executed Data Export workflows with Custom and Default profiles;
2 Tenants: Executed Harvesting workflows(OAI - PMH);
2 Tenants: Executed Refresh Lists workflows.
Test scenario 2: High load(x3)
24 Tenants: Check-In/Check-Out (CICO);
9 Tenants: Executed Data Import (50K, Profile: PTF-Create-3);
9 Tenants: Bulk Edit for holdings, users, and items with upload and edit operations;
6 Tenants: Executed Data Export workflows with Custom and Default profiles;
6 Tenants: Executed Harvesting workflows(OAI - PMH);
2 Tenants: Executed Refresh Lists workflows.
Results
SCENARIO | Baseline r7g 4xl | Baseline r7g 2xl + SL 5-64ACU | Test 3 | Test 4 | Test 5 | 1 VU | |
Check | In | 5,4 sec | 4,2 sec | 3.02 s | 2.40 sec | 3.12 s | 3.3 sec |
Out | 7 sec | 5,89 sec | 4.67 s | 3.50 sec | 4.37 s | 4.1 sec | |
Data Import | 35 min | 26 min | 18 min | 57 min | 59 min | ||
Data export | Custom profile | 23 min | 17 min | 17 min | 9 min | 9 min | 8.6 min |
Default profile | 5 min | 4,2 min | 5 min | 4,6 min | 3,8 min | 4,37 min | |
OAI - PMH | 0,5 sec | 0,456 sec | 0,34 sec | 0,536 sec | 0,55 sec | 0.372 sec | |
Bulk Edit Upload | holdings | 66 sec | 95 sec | 71 sec | 96 sec | 93 sec | 66 sec |
users | 31 sec | 40 sec | 36 sec | 53 sec | 45 sec | 32 sec | |
items | 76 sec | 120 sec | 95 sec | 152 sec | 174 sec | 231 sec | |
Bulk Edit edit | holdings | 250 sec | 192 sec | 212 | 234 sec | 227 sec | 141 sec |
users | 57 sec | 39 sec | 42 sec | 42 sec | 44 sec | 37 sec | |
items | 112 sec | 84 sec | 97 sec | 96 sec | 104 sec | 53 sec | |
ListApp | 4 sec - 120 sec | 3 sec - 9 min | 3 sec - 4 min | failed | 14 sec-8 min | The comparison table is below | |
Service CPU utilization
Because all test iterations used the same JMeter script, the resulting CPU utilization profiles were highly consistent across runs. For clarity and to avoid redundancy, only the Service CPU graph from the final test execution is included below as a representative sample.
Test 3.
Test 4