PTF-Test Heavy Workflows on multiple tenants concurrently[Galileo enviroment].

PTF-Test Heavy Workflows on multiple tenants concurrently[Galileo enviroment].

 

Overview

This performance testing initiative is designed to assess the system's stability and performance degradation when handling multiple concurrent, resource-intensive workflows across an increasing number of tenants. The primary goal is to identify the saturation point where system resources become critical, leading to unacceptable performance or instability. All test were performed on Sunflower Eureka Galileo environment- SEGCON.
Ticket:PTF-Test Heavy Workflows on multiple tenants concurrently[Galileo enviroment].

Summary

Heavy Workflows summary

  1. The performance assessment involved executing a set of Heavy Workflows (Data Import, Data Export, Bulk Edit, and OAI-PMH) with number of Virtual User from 1 to 9 alongside 8VU doing CICO on configurations that scaled both the DB (from 2xl to 4xl) and the Kafka Broker (from xlarge to 2xlarge). To imprope performance we increased the load limits (x4 and x8 from defaults values), the shared buffer (+30%), and 8 tasks to main modules mod-inventory,mod-inventory-storage, mod-srm and mod-srs. With growing number of VU, each VU was working on another tenant as showed in the table test runs.

  • Data Import Performance

    • With increasing workload (HW1 → HW6), execution time grew from 10 min to ~90 min. Thus, we can see a correlation: a twofold increase in the number of DI causes a twofold increase in duration.

    • Tests HW6 and HW7 indicated a bottleneck in the Kafka broker, as its CPU utilization reached 70%. Consequently, we decided to increase the instance type in the subsequent test to address this limitation.

    • After upgrading Kafka broker and DB instances (HW7–HW9), execution stabilized on the level of HW5 and was equal to DI dration of 6 VU

    • Significant performance improvements were observed after: increasing DB instance sizes and increasing kafka broker instance sizes

    • raising loadLimit ×4 and x8, increasing shared_buffers by 30%, enabling 8 parallel tasks didn`t show improvements

  • The Check-In/Check-Out (CICO) workflow, consistently executed with 8 virtual users (VU) across all testing phases (HW1 through HW11), served as the critical metric for measuring the impact of heavy background workflows on key workflow. CICO responce times experienced a degradation as the concurrency of the Heavy Workflows (DI, DE, BE, OAI-PMH) increased.

    • Baseline (HW1): CICO transactions had very good responce times, with Check-In at 1.9 sec and Check-Out at 2.4 sec.

    • Saturation Point (HW6-HW8): As the number of concurrent heavy workflows reached the maximum of 9 per type across multiple tenants, CICO latency increase in 4 times: Check-In peaked at 7.0 sec Check-Out peaked at 10.0 sec (HW8).

    • DB Bottleneck (HW8): The peak latencies of 7.0 sec (Check-In) and 10.0 sec (Check-Out) occurred in HW8, where the system was under maximum load (9 tenants) with a scaled Kafka Broker (2xl) but the initial DB size (2xl). This reinforces the finding that database load (Bulk Edit, Data export, OAI-PMH and Data Import) was the primary constraint, significantly impacting CICO performance.

    • Scaling and Tuning (HW9-HW11): CICO performance saw a substantial recovery when both the DB and application were optimized:

      • In HW9 (DB 4xl, Kafka 2xl), latencies slightly improved to 1.7 sec (Check-In) and 3.4 sec (Check-Out).

      • In the final, most optimized tests (HW10/HW11), where application limits and task allocation were tuned, CICO stabilized at a near-baseline level: 1.7–2.1 sec (Check -In) and 3.7–4.7 sec (Check-Out).

    • The Bulk Edit (BE) workflow, particularly the Holdings Edit operation, proved to be the most demanding process, exhibiting a nearly 13-fold increase in latency (up to 1027 seconds). Data Export (DE) performance severely degraded, with Custom Profile times surging five-fold and eventually failing to start, underscoring its vulnerability to resource exhaustion(we think this is because 1000% and higher CPU utilization on data-export module). Meanwhile, the OAI-PMH process, though not experiencing extreme latency, however we can see 3 times degradation on key requests

Data Import (DI) Performance Analysis (9 Concurrent Tenants)

  • This perf.testing focuses on the impact of infrastructure scaling and application tuning under the high load, on the execution time of 50K Data Imports concurrently across 9 tenants (DI-9P-T1 through DI-9P-T6).

    Initial Infrastructure Baseline (DI-9P-T1 & T2)

Tests DI-9P-T1 and DI-9P-T2 established the baseline performance under high load (9 parallel DI runs) but with different database (DB) and Kafka broker configurations.

- Average Performance: Execution times were consistently high, averaging 73 minutes in DI-9P-T1 (DB r7g.2xl, Kafka m7g.xlarge) and 71-72 minutes in DI-9P-T2 (DB r7g.4xl, Kafka m7g.xlarge).

- Conclusion on Scaling: The upgrating from the r7g.2xl DB (T1) to the r7g.4xl$ DB (T2) provided only a small improvement (1-2 minutes) or possibly none(if we recalculate it in %), indicating that the DB was not the bottleneck when scaled up to r7g.4xl. The constraint was likely elsewhere, such as the Kafka broker or modules limitations.

- DI-9P-T3 (30% SharedBuffers & LoadLimit x2): Performance remained stable at approximately 70 minutes. The slight increase in the LoadLimit proved ineffective at this concurrency level(But in the logs we can lees messages that consumer sent to Kafka, graphs are in Additional testing results graphs for DI 50K tests section).

- DI-9P-T4 (30% SharedBuffers & LoadLimit x4): This increase LoadLimit to x4 achieved a minor average improvement. Most tenants ran in 76 minutes, but one outlier (cs0006 at 64 minutes) . The average performance across the 9 tenants still remained around the 70-minute . No performance improvements

- DI-9P-T5 (Kafka m7g.2xlarge + DB 4xl + 4x Module Tasks) .Execution time dropped to an average of 47-48 minutes (with cs0006 hitting an excellent 30 minutes). This 40% time reduction (from ~76 min to ~48 min) was achieved by combining the Kafka broker upgrade (increasing message throughput capacity) with increased DB instance(4xl) and with the 4x increase in task allocation for key processing modules (inv, srs). This combination successfully eliminated the primary bottleneck.

- DI-9P-T6 (Kafka m7g.2xlarge + DB 4xl +8x Module Tasks) DI execution times further improved to 38–41 minute range. Increasing module task allocation to 8x yielded an additional ~7-minute gain, proving that Data Import throughput is directly proportional to the number of modules that are workiing in parallel.

- The initial doubling of the LoadLimit (x2 and x4) showed no performance gain gain.

Test Runs 

Test setup Heavy Workflows #1

Test #

Number of virtula users

Duration, sec 

Tenants

Comment

DB

Kafka broker

Additional configuratio

Data Import
(50K PTF Create -3)

Data Export
(100K BIB exports)

OAI-PMH (incremental harvesting)

Bulk Edit (holdings)

Bulk Edit (users)

Bulk Edit (items)

CICO

HW1

1

1

1

1

1

1

8

1200

0002

1 user on 0002 tenant

2xl

kafka.m7g.xlarge


 

HW2

2

2

2

2

2

2

8

1200

0002

2 user on 0002 tenant

2xl

kafka.m7g.xlarge

 

HW3

2

2

2

2

2

2

8

1200

0002 & 0003

1 user on 0002 tenant and 1 user on 0003 tenant

2xl

kafka.m7g.xlarge

 

HW4

4

4

4

4

4

4

8

1200

0002 & 0003 & 0004& 0006

1 user on each tenant

2xl

kafka.m7g.xlarge

 

HW5

6

6

6

6

6

6

8

1200

0002 & 0003 & 0004& 0006& 0007& 0008

1 user on each tenant

2xl

kafka.m7g.xlarge

 

HW6

9

9

9

9

9

9

8

1200

0002 & 0003 & 0004& 0006& 0007& 0008 & 0009 & 00012& 00013&

Test 6 and 7 with increaser load showed same kafka broker CPU utilization, so for next step broker instance type changed to 2xl

2xl

kafka.m7g.xlarge

 

HW7

9

9

9

9

9

9

8

1200

--||--

4xl

kafka.m7g.xlarge

 

HW8

9

9

9

9

9

9

8

1200

--||--

 

2xl

kafka.m7g.2xlarge

 

HW9

9

9

9

9

9

9

8

1200

--||--

 

4xl

kafka.m7g.2xlarge

 

HW10

9

9

9

9

9

9

8

1200

--||--

 

4xl

kafka.m7g.2xlarge

+ x4 loadLimit & + 30% shared buffer

HW11

9

9

9

9

9

9

8

1200

--||--

 

4xl

kafka.m7g.2xlarge

+ x4 loadLimit & + 30% shared buffer
and 8 tasks mod -inventory, mod-inventory-storage, srs,srm

 

Test setup Data Import #2
Data Import (50K create imports) in parralel on 9 tenants

 

Results

Heavy Workflows

This table contains testing results for “Test setup Heavy Workflows #1“

Duration

HW1

HW2

HW3

HW4

HW5

HW6

HW7

HW8

HW9

HW10

HW11

Duration

HW1

HW2

HW3

HW4

HW5

HW6

HW7

HW8

HW9

HW10

HW11

Data Import
(50K create imports)

10 min 20 sec

19 min 41 sec

22 min

38 min

43 min

1 hour and 31 min

1:29(*)

2:02

1:29(*)

2:02

1:29(*)

2:02
1:01

2:02

2:02

2:02

2:02

2:01

98 min

73 min

77 min

Didn`t start

DE

100k

Custom profile

358 sec

398 sec

870 sec

1255 sec

1710 sec

failed to start

1532 sec

1657 sec

1265 sec

1330 sec

1451

Default profile

260 sec

193 sec

420 sec

failed to start

803 sec

1103 sec

712 sec

774 sec

562 sec

738 sec

698 sec

OAI-PMH (incremental harvesting)

0.44 sec

0.398 sec

0.53 sec

1.5 sec

1.2 sec(Failed on 50% on the tenant)

1.6 sec

1.8 sec

1 sec

failed to start

1.7 sec

1.65 sec

Bulk Edit

upload

holdings

80 sec

56 sec

71 sec

83 sec

66 sec

97 sec

96 sec

110 sec