PTF - Performance testing of CPU=0 and tasks placement strategy for Services (QCP1)
Overview
In this report, PTF investigates the impact of setting CPU=0 and disabling the distinctInstance task placement strategy on FOLIO’s performance, with and without New Relic/OpenTelemetry enabled. Previous observations suggested that this configuration negatively affects system performance. The goal of these tests was to reproduce the issue and evaluate whether performance could be improved by adjusting CPU allocations and enabling distinctInstance placement. Through a series of experiments, we analyzed how these settings influence system behavior, ensuring that CPU allocation and task distribution strategies are optimized for stability and efficiency.
PERF-1071: Test with various scenarios for CPU and task placement strategyClosed
Summary
The performance tests conducted in this report indicate that neither the distinctInstance placement strategy nor the use of New Relic/OpenTelemetry had a significant impact on system performance. Across all tests, performance variations remained within a 5% margin. The slight difference observed in Test №3 was attributed to a lower instance count rather than the placement strategy itself. Further tests (№4 and №5) confirmed these findings, showing consistent results regardless of the number of instances used. Additionally, disabling New Relic did not affect performance, suggesting that monitoring overhead was negligible. Overall, the experiments demonstrate that the tested configurations do not introduce meaningful performance differences.
It appears that increasing the number of virtual users in our test led to contention in the database, specifically due to a high volume of concurrent updates on the same row in the
auth_attemptstable bymod-login. The query:UPDATE fs09000000_mod_login.auth_attempts SET jsonb = $1::jsonb WHERE id='9883ca16-ef27-41f7-81d7-6693b79cddad'suggests that multiple sessions are attempting to modify the same record simultaneously, leading to row-level locking. As a result, transactions are waiting in the background for the lock to be released, potentially causing performance degradation or request timeouts.
Test Runs
Test # | Description | Status |
|---|---|---|
Test 1 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned ON, CPU values set for list of modules | Completed |
Test 2 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned OFF, CPU values set for list of modules | Completed |
Test 3 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned OFF, CPU values set 0 for all services | Completed |
Test 4 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory and mod-circulation set TRUE, distinctInstance placement strategy turned OFF, CPU values set 0 for all services | Completed |
Test 5 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory and mod-circulation set TRUE, distinctInstance placement strategy turned OFF, CPU values set for list of modules | Completed |
Test 6 | New Relic/OpenTelemetry disabled, distinctInstance placement strategy turned OFF, CPU values set 0 for all services | Completed |
Test 7 | New Relic/OpenTelemetry disabled, distinctInstance placement strategy turned OFF, CPU values set for list of modules | Completed |
Test Results
This table contains response time for Check In\Check Out and RTAC tests.
Requests | Test №1 | Test №2 | Test №3 | Test №4 | Test №5 | Test №6 | Test №7 |
| Average | Average | Average | Average | Average | Average | Average |
Check-In Controller | 995 | 965 | 1057 | 886 | 922 | 929 | 917 |
Check-Out Controller | 1518 | 1502 | 1637 | 1352 | 1392 | 1431 | 1447 |
RTAC | 1174 | 1136 | 1258 | 1087 | 1069 | 1085 | 1091 |
Test №1-2-3
Test №1: Modules have CPU = 0, and distinctInstance placement strategy is OFF.
Goal: Establish baseline performance metrics for comparison with subsequent configurations.
Test №2: Modules are assigned specific CPU values, with distinctInstance placement strategy still OFF.
Goal: Evaluate whether performance improves compared to the baseline (Test №1).
Test №3: Modules are assigned specific CPU values, and distinctInstance placement strategy is ON.
Goal: Assess whether enabling distinctInstance further enhances performance over the previous tests.
Results: Performance remained nearly the same across all tests, with differences of less than 5%. Test №3 showed a slight variance due to having 5 instances instead of 6, as used in Tests №1 and №2, indicating no negative impact from disabling the distinctInstance placement strategy.
Service CPU Utilization
Here we can see Test №1 with CPU=VALUE and that mod-rtac module used 112% CPU power.
Here we can see Test №2 with CPU=VALUE and that mod-rtac module used 109% CPU power.
Here we can see Test №3 with CPU=0 and that mod-nginx-okapi and okapi modules used 12% Instances CPU power.
Service Memory Utilization
Here we can see that all modules show a stable trend.
Kafka metrics
OpenSearch Data Nodes metrics
DB CPU Utilization
DB CPU was 85% average.
DB Connections
Max number of DB connections was 950.
DB load
Top SQL-queries
Test №4-5
Goal: Repeat Test №2 and Test №3 to validate previous results.
Results: The results remained the same, despite Test №4 running with 5 instances and Test №5 with 6 instances, confirmed no negative impact from disabling the distinctInstance placement strategy.
Service CPU Utilization
Here we can see Test №4 with CPU=0 and that mod-nginx-okapi and okapi modules used 10% Instances CPU power.
Here we can see Test №5 with CPU=VALUE and that mod-rtac module used 131% CPU power.
Service Memory Utilization
Here we can't see any sign of memory leaks on every module. Memory shows stable trend.
Kafka metrics
OpenSearch Data Nodes metrics
DB CPU Utilization
DB CPU was 85%
DB Connections
Max number of DB connections was 1600.
DB load
Top SQL-queries
Test №6-7
Goal: Disable New Relic and repeat Test №4 and Test №5 to observe any effects.
Results: The results remained the same compared to all previous tests, showing no impact from using or not using New Relic.
Service CPU Utilization
Here we can see Test №6 with CPU=0 and that mod-nginx-okapi and okapi modules used 12% Instances CPU power.
Here we can see Test №7 with CPU=VALUE and that mod-rtac module used 101% CPU power.
Service Memory Utilization
Here we can't see any sign of memory leaks on every module. Memory shows stable trend.
Kafka metrics
OpenSearch Data Nodes metrics
DB CPU Utilization
DB CPU was 85% maximum.
DB Connections
Max number of DB connections was 1050.
DB load
Top SQL-queries
Appendix
Infrastructure
PTF - QCP1 environment configuration (was changed during testing)
5-6 r7g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 database instance, writer
Open Search ptf-test
Data nodes
Instance type - r6g.2xlarge.search
Number of nodes - 4
Version: OpenSearch_2_7_R20240502
Dedicated master nodes
Instance type - r6g.large.search
Number of nodes - 3
MSK fse-tenant
2 brokers, kafka.m7g.xlarge brokers in 2 zones
Apache Kafka version 3.7.x
EBS storage volume per broker 300 GiB
auto.create.topics.enable=true
log.retention.minutes=480
default.replication.factor=3