PTF - Performance testing of CPU=0 and tasks placement strategy for Services (QCP1)
Overview
- In this report, PTF investigates the impact of setting CPU=0 and disabling the distinctInstance task placement strategy on FOLIO’s performance, with and without New Relic/OpenTelemetry enabled. Previous observations suggested that this configuration negatively affects system performance. The goal of these tests was to reproduce the issue and evaluate whether performance could be improved by adjusting CPU allocations and enabling distinctInstance placement. Through a series of experiments, we analyzed how these settings influence system behavior, ensuring that CPU allocation and task distribution strategies are optimized for stability and efficiency.
- PERF-1071Getting issue details... STATUS
Summary
- The performance tests conducted in this report indicate that neither the distinctInstance placement strategy nor the use of New Relic/OpenTelemetry had a significant impact on system performance. Across all tests, performance variations remained within a 5% margin. The slight difference observed in Test №3 was attributed to a lower instance count rather than the placement strategy itself. Further tests (№4 and №5) confirmed these findings, showing consistent results regardless of the number of instances used. Additionally, disabling New Relic did not affect performance, suggesting that monitoring overhead was negligible. Overall, the experiments demonstrate that the tested configurations do not introduce meaningful performance differences.
- It appears that increasing the number of virtual users in our test led to contention in the database, specifically due to a high volume of concurrent updates on the same row in the
auth_attempts
table bymod-login
. The query:UPDATE fs09000000_mod_login.auth_attempts SET jsonb = $1::jsonb WHERE id='9883ca16-ef27-41f7-81d7-6693b79cddad'
suggests that multiple sessions are attempting to modify the same record simultaneously, leading to row-level locking. As a result, transactions are waiting in the background for the lock to be released, potentially causing performance degradation or request timeouts.
Test Runs
Test # | Description | Status |
---|---|---|
Test 1 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned ON, CPU values set for list of modules | Completed |
Test 2 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned OFF, CPU values set for list of modules | Completed |
Test 3 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned OFF, CPU values set 0 for all services | Completed |
Test 4 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory and mod-circulation set TRUE, distinctInstance placement strategy turned OFF, CPU values set 0 for all services | Completed |
Test 5 | New Relic/OpenTelemetry enabled, OTEL value for mod-inventory and mod-circulation set TRUE, distinctInstance placement strategy turned OFF, CPU values set for list of modules | Completed |
Test 6 | New Relic/OpenTelemetry disabled, distinctInstance placement strategy turned OFF, CPU values set 0 for all services | Completed |
Test 7 | New Relic/OpenTelemetry disabled, distinctInstance placement strategy turned OFF, CPU values set for list of modules | Completed |
Test Results
This table contains response time for Check In\Check Out and RTAC tests.
Requests | Test №1 Response time, ms | Test №2 | Test №3 | Test №4 | Test №5 | Test №6 | Test №7 |
Average | Average | Average | Average | Average | Average | Average | |
Check-In Controller | 995 | 965 | 1057 | 886 | 922 | 929 | 917 |
Check-Out Controller | 1518 | 1502 | 1637 | 1352 | 1392 | 1431 | 1447 |
RTAC | 1174 | 1136 | 1258 | 1087 | 1069 | 1085 | 1091 |
Test №1-2-3
Test №1: Modules have CPU = 0, and distinctInstance placement strategy is OFF.
Goal: Establish baseline performance metrics for comparison with subsequent configurations.
Test №2: Modules are assigned specific CPU values, with distinctInstance placement strategy still OFF.
Goal: Evaluate whether performance improves compared to the baseline (Test №1).
Test №3: Modules are assigned specific CPU values, and distinctInstance placement strategy is ON.
Goal: Assess whether enabling distinctInstance further enhances performance over the previous tests.
Results: Performance remained nearly the same across all tests, with differences of less than 5%. Test №3 showed a slight variance due to having 5 instances instead of 6, as used in Tests №1 and №2, indicating no negative impact from disabling the distinctInstance placement strategy.
Service CPU Utilization
Here we can see Test №1 with CPU=VALUE and that mod-rtac module used 112% CPU power.
Here we can see Test №2 with CPU=VALUE and that mod-rtac module used 109% CPU power.
Here we can see Test №3 with CPU=0 and that mod-nginx-okapi and okapi modules used 12% Instances CPU power.
Service Memory Utilization
Here we can see that all modules show a stable trend.
Kafka metrics
OpenSearch Data Nodes metrics
DB CPU Utilization
DB CPU was 85% average.
DB Connections
Max number of DB connections was 950.
DB load
Top SQL-queries
Test №4-5
Goal: Repeat Test №2 and Test №3 to validate previous results.
Results: The results remained the same, despite Test №4 running with 5 instances and Test №5 with 6 instances, confirmed no negative impact from disabling the distinctInstance placement strategy.
Service CPU Utilization
Here we can see Test №4 with CPU=0 and that mod-nginx-okapi and okapi modules used 10% Instances CPU power.
Here we can see Test №5 with CPU=VALUE and that mod-rtac module used 131% CPU power.
Service Memory Utilization
Here we can't see any sign of memory leaks on every module. Memory shows stable trend.
Kafka metrics
OpenSearch Data Nodes metrics
DB CPU Utilization
DB CPU was 85%
DB Connections
Max number of DB connections was 1600.
DB load
Top SQL-queries
Test №6-7
Goal: Disable New Relic and repeat Test №4 and Test №5 to observe any effects.
Results: The results remained the same compared to all previous tests, showing no impact from using or not using New Relic.
Service CPU Utilization
Here we can see Test №6 with CPU=0 and that mod-nginx-okapi and okapi modules used 12% Instances CPU power.
Here we can see Test №7 with CPU=VALUE and that mod-rtac module used 101% CPU power.
Service Memory Utilization
Here we can't see any sign of memory leaks on every module. Memory shows stable trend.
Kafka metrics
OpenSearch Data Nodes metrics
DB CPU Utilization
DB CPU was 85% maximum.
DB Connections
Max number of DB connections was 1050.
DB load
Top SQL-queries
Appendix
Infrastructure
PTF - QCP1 environment configuration (was changed during testing)
- 5-6 r7g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1
1 database instance, writer
Name Memory GIB vCPUs db.r6g.xlarge
32 GB 4 vCPUs - Open Search ptf-test
- Data nodes
- Instance type - r6g.2xlarge.search
- Number of nodes - 4
- Version: OpenSearch_2_7_R20240502
- Dedicated master nodes
- Instance type - r6g.large.search
- Number of nodes - 3
- Data nodes
- MSK fse-tenant
- 2 brokers, kafka.m7g.xlarge brokers in 2 zones
Apache Kafka version 3.7.x
EBS storage volume per broker 300 GiB
- auto.create.topics.enable=true
- log.retention.minutes=480
- default.replication.factor=3
Methodology/Approach
CICO Tests scenarios were started for 100 users and concurrently RTAC for 10 users by JMeter script from load generator.
QCP1 configuration and steps to configure ECS infrastructure:
Instance type r7g.2xlarge
When New Relic/Open Telementry enabled OTEL value in mod-inventory's task definition set to true (OTEL_INSTRUMENTATION_RUNTIME_TELEMETRY_ENABLED=true)
- mod-oa-b service was turn off for all tests
- during testing CPU value was applied for list of modules:
- mod-inventory: 3072
- mod-inventory-storage: 2048
- mod-circulation: 1536
- mod-circulation-storage: 1536
- mod-feesfines: 256
- mod-orders: 1024
- mod-orders-storage: 512
- mod-login: 1024
- mod-source-record-storage: 2048
- mod-rtac: 128
- mod-patron: 128
Test 1:
The QCP1 environment was configured with 6 Instances, New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned ON, CPU values set for list of modules, 4 tasks assigned to mod-inventory, mod-inventory-storage, mod-circulation, and mod-circulation-storage, with Check-In/Check-Out tests running 100 users and RTAC tests running 10 users concurrently on a main tenant for 35 minutes.
Test 2:
The QCP1 environment was configured with with 6 Instances, New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned OFF, CPU values set for list of modules, 4 tasks assigned to mod-inventory, mod-inventory-storage, mod-circulation, and mod-circulation-storage, with Check-In/Check-Out tests running 100 users and RTAC tests running 10 users concurrently on a main tenant for 35 minutes.
Test 3:
The QCP1 environment was configured with with 5 Instances, New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned OFF, CPU values set 0 for all services, 4 tasks assigned to mod-inventory, mod-inventory-storage, mod-circulation, and mod-circulation-storage, with Check-In/Check-Out tests running 100 users and RTAC tests running 10 users concurrently on a main tenant for 35 minutes.
Test 4:
The QCP1 environment was configured with with 5 Instances, New Relic/OpenTelemetry enabled, OTEL value for mod-inventory and mod-circulation set TRUE, distinctInstance placement strategy turned OFF, CPU values set 0 for all services, 4 tasks assigned to mod-inventory, mod-inventory-storage, mod-circulation, and mod-circulation-storage, with Check-In/Check-Out tests running 100 users and RTAC tests running 10 users concurrently on a main tenant for 35 minutes.
Test 5:
The QCP1 environment was configured with with 6 Instances, New Relic/OpenTelemetry enabled, OTEL value for mod-inventory and mod-circulation set TRUE, distinctInstance placement strategy turned OFF, CPU values set for list of modules, 4 tasks assigned to mod-inventory, mod-inventory-storage, mod-circulation, and mod-circulation-storage, with Check-In/Check-Out tests running 100 users and RTAC tests running 10 users concurrently on a main tenant for 35 minutes.
Test 6:
The QCP1 environment was configured with with 5 Instances, New Relic/OpenTelemetry disabled, distinctInstance placement strategy turned OFF, CPU values set 0, 4 tasks assigned to mod-inventory, mod-inventory-storage, mod-circulation, and mod-circulation-storage, with Check-In/Check-Out tests running 100 users and RTAC tests running 10 users concurrently on a main tenant for 35 minutes.
Test 7:
The QCP1 environment was configured with with 6 Instances, New Relic/OpenTelemetry disabled, distinctInstance placement strategy turned OFF, CPU values set for list of modules, 4 tasks assigned to mod-inventory, mod-inventory-storage, mod-circulation, and mod-circulation-storage, with Check-In/Check-Out tests running 100 users and RTAC tests running 10 users concurrently on a main tenant for 35 minutes.
Test artifacts: