PTF-Test Heavy Workflows on multiple tenants concurrently[Galileo enviroment].
- 1 Overview
- 2 Summary
- 3 Test Runs
- 4 Results
- 4.1 Heavy Workflows
- 4.2 Data Import
- 5 Resource utilization graphs
- 5.1 CPU utiliation graphs for Heavy Workflows
- 5.2
- 5.3 RDS CPU utiliation graphs for Heavy Workflows
- 5.4 Database connection graphs for Heavy Workflows tests
- 5.5 Database load graphs for Heavy Workflows
- 5.6 Kafka brokers resource utilization for Heavy Workflows
- 5.7 CPU utiliation graphs for DI 50K tests
- 5.8 DB CPU Utiliation for DI 50K tests
- 5.9 RDS DB connections during DI 50K tests
- 5.10 Aurora Estimates Shared memory graphs for tests DI-9P-T1 , DI-9P-T2, DI-9P-T3
- 5.11 Freeable Memory for DI 50K tests
- 5.12 Database load for DI 50K tests
- 5.13 Kafka brokers resorce utilization for DI 50K tests
- 5.14 Additional testing results graphs for DI 50K tests
- 5.15 Errors during DI-9P-T5 and DI-9P-T6
- 6 Appendix
- 6.1 Infrastructure
- 6.2 Cluster Resources - SEGCON
- 6.3
- 6.4 Methodology/approach
Overview
This performance testing initiative is designed to assess the system's stability and performance degradation when handling multiple concurrent, resource-intensive workflows across an increasing number of tenants. The primary goal is to identify the saturation point where system resources become critical, leading to unacceptable performance or instability. All test were performed on Sunflower Eureka Galileo environment- SEGCON.
Ticket:PTF-Test Heavy Workflows on multiple tenants concurrently[Galileo enviroment].
Summary
Heavy Workflows summary
The performance assessment involved executing a set of Heavy Workflows (Data Import, Data Export, Bulk Edit, and OAI-PMH) with number of Virtual User from 1 to 9 alongside 8VU doing CICO on configurations that scaled both the DB (from 2xl to 4xl) and the Kafka Broker (from xlarge to 2xlarge). To imprope performance we increased the load limits (x4 and x8 from defaults values), the shared buffer (+30%), and 8 tasks to main modules
mod-inventory,mod-inventory-storage, mod-srmandmod-srs. With growing number of VU, each VU was working on another tenant as showed in the table test runs.
Data Import Performance
With increasing workload (HW1 → HW6), execution time grew from 10 min to ~90 min. Thus, we can see a correlation: a twofold increase in the number of DI causes a twofold increase in duration.
Tests HW6 and HW7 indicated a bottleneck in the Kafka broker, as its CPU utilization reached 70%. Consequently, we decided to increase the instance type in the subsequent test to address this limitation.
After upgrading Kafka broker and DB instances (HW7–HW9), execution stabilized on the level of HW5 and was equal to DI dration of 6 VU
Significant performance improvements were observed after: increasing DB instance sizes and increasing kafka broker instance sizes
raising loadLimit ×4 and x8, increasing shared_buffers by 30%, enabling 8 parallel tasks didn`t show improvements
The Check-In/Check-Out (CICO) workflow, consistently executed with 8 virtual users (VU) across all testing phases (HW1 through HW11), served as the critical metric for measuring the impact of heavy background workflows on key workflow. CICO responce times experienced a degradation as the concurrency of the Heavy Workflows (DI, DE, BE, OAI-PMH) increased.
Baseline (HW1): CICO transactions had very good responce times, with Check-In at 1.9 sec and Check-Out at 2.4 sec.
Saturation Point (HW6-HW8): As the number of concurrent heavy workflows reached the maximum of 9 per type across multiple tenants, CICO latency increase in 4 times: Check-In peaked at 7.0 sec Check-Out peaked at 10.0 sec (HW8).
DB Bottleneck (HW8): The peak latencies of 7.0 sec (Check-In) and 10.0 sec (Check-Out) occurred in HW8, where the system was under maximum load (9 tenants) with a scaled Kafka Broker (
2xl) but the initial DB size (2xl). This reinforces the finding that database load (Bulk Edit, Data export, OAI-PMH and Data Import) was the primary constraint, significantly impacting CICO performance.Scaling and Tuning (HW9-HW11): CICO performance saw a substantial recovery when both the DB and application were optimized:
In HW9 (DB 4xl, Kafka 2xl), latencies slightly improved to 1.7 sec (Check-In) and 3.4 sec (Check-Out).
In the final, most optimized tests (HW10/HW11), where application limits and task allocation were tuned, CICO stabilized at a near-baseline level: 1.7–2.1 sec (Check -In) and 3.7–4.7 sec (Check-Out).
The Bulk Edit (BE) workflow, particularly the Holdings Edit operation, proved to be the most demanding process, exhibiting a nearly 13-fold increase in latency (up to 1027 seconds). Data Export (DE) performance severely degraded, with Custom Profile times surging five-fold and eventually failing to start, underscoring its vulnerability to resource exhaustion(we think this is because 1000% and higher CPU utilization on data-export module). Meanwhile, the OAI-PMH process, though not experiencing extreme latency, however we can see 3 times degradation on key requests
Data Import (DI) Performance Analysis (9 Concurrent Tenants)
This perf.testing focuses on the impact of infrastructure scaling and application tuning under the high load, on the execution time of 50K Data Imports concurrently across 9 tenants (DI-9P-T1 through DI-9P-T6).
Initial Infrastructure Baseline (DI-9P-T1 & T2)
Tests DI-9P-T1 and DI-9P-T2 established the baseline performance under high load (9 parallel DI runs) but with different database (DB) and Kafka broker configurations.
- Average Performance: Execution times were consistently high, averaging 73 minutes in DI-9P-T1 (DB r7g.2xl, Kafka m7g.xlarge) and 71-72 minutes in DI-9P-T2 (DB r7g.4xl, Kafka m7g.xlarge).
- Conclusion on Scaling: The upgrating from the r7g.2xl DB (T1) to the r7g.4xl$ DB (T2) provided only a small improvement (1-2 minutes) or possibly none(if we recalculate it in %), indicating that the DB was not the bottleneck when scaled up to r7g.4xl. The constraint was likely elsewhere, such as the Kafka broker or modules limitations.
- DI-9P-T3 (30% SharedBuffers & LoadLimit x2): Performance remained stable at approximately 70 minutes. The slight increase in the LoadLimit proved ineffective at this concurrency level(But in the logs we can lees messages that consumer sent to Kafka, graphs are in Additional testing results graphs for DI 50K tests section).
- DI-9P-T4 (30% SharedBuffers & LoadLimit x4): This increase LoadLimit to x4 achieved a minor average improvement. Most tenants ran in 76 minutes, but one outlier (cs0006 at 64 minutes) . The average performance across the 9 tenants still remained around the 70-minute . No performance improvements
- DI-9P-T5 (Kafka m7g.2xlarge + DB 4xl + 4x Module Tasks) .Execution time dropped to an average of 47-48 minutes (with cs0006 hitting an excellent 30 minutes). This 40% time reduction (from ~76 min to ~48 min) was achieved by combining the Kafka broker upgrade (increasing message throughput capacity) with increased DB instance(4xl) and with the 4x increase in task allocation for key processing modules (inv, srs). This combination successfully eliminated the primary bottleneck.
- DI-9P-T6 (Kafka m7g.2xlarge + DB 4xl +8x Module Tasks) DI execution times further improved to 38–41 minute range. Increasing module task allocation to 8x yielded an additional ~7-minute gain, proving that Data Import throughput is directly proportional to the number of modules that are workiing in parallel.
- The initial doubling of the LoadLimit (x2 and x4) showed no performance gain gain.
Test Runs
Test setup Heavy Workflows #1
Test # | Number of virtula users | Duration, sec | Tenants | Comment | DB | Kafka broker | Additional configuratio | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Data Import | Data Export | OAI-PMH (incremental harvesting) | Bulk Edit (holdings) | Bulk Edit (users) | Bulk Edit (items) | CICO | |||||||
HW1 | 1 | 1 | 1 | 1 | 1 | 1 | 8 | 1200 | 0002 | 1 user on 0002 tenant | 2xl | kafka.m7g.xlarge |
|
HW2 | 2 | 2 | 2 | 2 | 2 | 2 | 8 | 1200 | 0002 | 2 user on 0002 tenant | 2xl | kafka.m7g.xlarge |
|
HW3 | 2 | 2 | 2 | 2 | 2 | 2 | 8 | 1200 | 0002 & 0003 | 1 user on 0002 tenant and 1 user on 0003 tenant | 2xl | kafka.m7g.xlarge |
|
HW4 | 4 | 4 | 4 | 4 | 4 | 4 | 8 | 1200 | 0002 & 0003 & 0004& 0006 | 1 user on each tenant | 2xl | kafka.m7g.xlarge |
|
HW5 | 6 | 6 | 6 | 6 | 6 | 6 | 8 | 1200 | 0002 & 0003 & 0004& 0006& 0007& 0008 | 1 user on each tenant | 2xl | kafka.m7g.xlarge |
|
HW6 | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 1200 | 0002 & 0003 & 0004& 0006& 0007& 0008 & 0009 & 00012& 00013& | Test 6 and 7 with increaser load showed same kafka broker CPU utilization, so for next step broker instance type changed to 2xl | 2xl | kafka.m7g.xlarge |
|
HW7 | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 1200 | --||-- | 4xl | kafka.m7g.xlarge |
| |
HW8 | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 1200 | --||-- |
| 2xl | kafka.m7g.2xlarge |
|
HW9 | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 1200 | --||-- |
| 4xl | kafka.m7g.2xlarge |
|
HW10 | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 1200 | --||-- |
| 4xl | kafka.m7g.2xlarge | + x4 loadLimit & + 30% shared buffer |
HW11 | 9 | 9 | 9 | 9 | 9 | 9 | 8 | 1200 | --||-- |
| 4xl | kafka.m7g.2xlarge | + x4 loadLimit & + 30% shared buffer |
Test setup Data Import #2
Data Import (50K create imports) in parralel on 9 tenants
Results
Heavy Workflows
This table contains testing results for “Test setup Heavy Workflows #1“
Duration | HW1 | HW2 | HW3 | HW4 | HW5 | HW6 | HW7 | HW8 | HW9 | HW10 | HW11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
Data Import | 10 min 20 sec | 19 min 41 sec | 22 min | 38 min | 43 min | 1 hour and 31 min | 1:29(*) 2:02 1:29(*) 2:02 1:29(*) 2:02 2:02 2:02 2:02 2:02 2:01 | 98 min | 73 min | 77 min | Didn`t start | |
DE 100k | Custom profile | 358 sec | 398 sec | 870 sec | 1255 sec | 1710 sec | failed to start | 1532 sec | 1657 sec | 1265 sec | 1330 sec | 1451 |
Default profile | 260 sec | 193 sec | 420 sec | failed to start | 803 sec | 1103 sec | 712 sec | 774 sec | 562 sec | 738 sec | 698 sec | |
OAI-PMH (incremental harvesting) | 0.44 sec | 0.398 sec | 0.53 sec | 1.5 sec | 1.2 sec(Failed on 50% on the tenant) | 1.6 sec | 1.8 sec | 1 sec | failed to start | 1.7 sec | 1.65 sec | |
Bulk Edit upload | holdings | 80 sec | 56 sec | 71 sec | 83 sec | 66 sec | 97 sec | 96 sec | 110 sec | |||