PTF - Performance testing of CPU=0 and tasks placement strategy for Services (QCP1)

PTF - Performance testing of CPU=0 and tasks placement strategy for Services (QCP1)

Overview

  • In this report, PTF investigates the impact of setting CPU=0 and disabling the distinctInstance task placement strategy on FOLIO’s performance, with and without New Relic/OpenTelemetry enabled. Previous observations suggested that this configuration negatively affects system performance. The goal of these tests was to reproduce the issue and evaluate whether performance could be improved by adjusting CPU allocations and enabling distinctInstance placement. Through a series of experiments, we analyzed how these settings influence system behavior, ensuring that CPU allocation and task distribution strategies are optimized for stability and efficiency.

PERF-1071: Test with various scenarios for CPU and task placement strategyClosed 

Summary

  • The performance tests conducted in this report indicate that neither the distinctInstance placement strategy nor the use of New Relic/OpenTelemetry had a significant impact on system performance. Across all tests, performance variations remained within a 5% margin. The slight difference observed in Test №3 was attributed to a lower instance count rather than the placement strategy itself. Further tests (№4 and №5) confirmed these findings, showing consistent results regardless of the number of instances used. Additionally, disabling New Relic did not affect performance, suggesting that monitoring overhead was negligible. Overall, the experiments demonstrate that the tested configurations do not introduce meaningful performance differences.

  • It appears that increasing the number of virtual users in our test led to contention in the database, specifically due to a high volume of concurrent updates on the same row in the auth_attempts table by mod-login. The query:

    UPDATE fs09000000_mod_login.auth_attempts SET jsonb = $1::jsonb WHERE id='9883ca16-ef27-41f7-81d7-6693b79cddad'
    suggests that multiple sessions are attempting to modify the same record simultaneously, leading to row-level locking. As a result, transactions are waiting in the background for the lock to be released, potentially causing performance degradation or request timeouts.

Test Runs

Test #

Description

Status

Test #

Description

Status

Test 1

New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned ON, CPU values set for list of modules

Completed

Test 2

New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned OFF, CPU values set for list of modules

Completed

Test 3

New Relic/OpenTelemetry enabled, OTEL value for mod-inventory set TRUE, distinctInstance placement strategy turned OFF, CPU values set 0 for all services

Completed

Test 4

New Relic/OpenTelemetry enabled, OTEL value for mod-inventory and mod-circulation set TRUE, distinctInstance placement strategy turned OFF, CPU values set 0 for all services

Completed

Test 5

New Relic/OpenTelemetry enabled, OTEL value for mod-inventory and mod-circulation set TRUE, distinctInstance placement strategy turned OFF, CPU values set for list of modules

Completed

Test 6

New Relic/OpenTelemetry disabled, distinctInstance placement strategy turned OFF, CPU values set 0 for all services

Completed

Test 7

New Relic/OpenTelemetry disabled, distinctInstance placement strategy turned OFF, CPU values set for list of modules

Completed

Test Results

This table contains response time for Check In\Check Out and RTAC tests

Requests

Test №1
Response time, ms

Test №2
Response time, ms

Test №3
Response time, ms

Test №4
Response time, ms

Test №5
Response time, ms

Test №6
Response time, ms

Test №7
Response time, ms

 

Average

Average

Average

Average

Average

Average

Average

Check-In Controller

995

965

1057

886

922

929

917

Check-Out Controller

1518

1502

1637

1352

1392

1431

1447

RTAC

1174

1136

1258

1087

1069

1085

1091

 

 

Test №1-2-3

Test №1: Modules have CPU = 0, and distinctInstance placement strategy is OFF.
Goal: Establish baseline performance metrics for comparison with subsequent configurations.

Test №2: Modules are assigned specific CPU values, with distinctInstance placement strategy still OFF.
Goal: Evaluate whether performance improves compared to the baseline (Test №1).

Test №3: Modules are assigned specific CPU values, and distinctInstance placement strategy is ON.
Goal: Assess whether enabling distinctInstance further enhances performance over the previous tests.

Results: Performance remained nearly the same across all tests, with differences of less than 5%. Test №3 showed a slight variance due to having 5 instances instead of 6, as used in Tests №1 and №2, indicating no negative impact from disabling the distinctInstance placement strategy.

Service CPU Utilization

Here we can see Test №1 with CPU=VALUE  and that mod-rtac module used 112% CPU power.

Here we can see Test №2 with CPU=VALUE  and that mod-rtac module used 109% CPU power.

Here we can see Test №3 with CPU=0 and that mod-nginx-okapi and okapi modules used 12% Instances CPU power.

Service Memory Utilization

Here we can see that all modules show a stable trend.

 

Kafka metrics

OpenSearch Data Nodes metrics

DB CPU Utilization

DB CPU was 85% average.

DB Connections

Max number of DB connections was 950.

DB load

Top SQL-queries

Test №4-5

Goal: Repeat Test №2 and Test №3 to validate previous results.

Results: The results remained the same, despite Test №4 running with 5 instances and Test №5 with 6 instances, confirmed no negative impact from disabling the distinctInstance placement strategy.

Service CPU Utilization

Here we can see Test №4 with CPU=0 and that mod-nginx-okapi and okapi modules used 10% Instances CPU power.

Here we can see Test №5 with CPU=VALUE  and that mod-rtac module used 131% CPU power.

Service Memory Utilization

Here we can't see any sign of memory leaks on every module. Memory shows stable trend.

 

Kafka metrics

 

OpenSearch Data Nodes metrics

DB CPU Utilization

DB CPU was 85%

DB Connections

Max number of DB connections was 1600.

DB load

Top SQL-queries

 

Test №6-7

Goal: Disable New Relic and repeat Test №4 and Test №5 to observe any effects.

Results: The results remained the same compared to all previous tests, showing no impact from using or not using New Relic.

Service CPU Utilization

Here we can see Test №6 with CPU=0 and that mod-nginx-okapi and okapi modules used 12% Instances CPU power.

Here we can see Test №7 with CPU=VALUE  and that mod-rtac module used 101% CPU power.

Service Memory Utilization

Here we can't see any sign of memory leaks on every module. Memory shows stable trend.

 

Kafka metrics

OpenSearch Data Nodes metrics

DB CPU Utilization

DB CPU was 85% maximum.

DB Connections

Max number of DB connections was 1050.

DB load

Top SQL-queries

Appendix

Infrastructure

PTF - QCP1 environment configuration (was changed during testing)

  • 5-6 r7g.2xlarge EC2 instances located in US East (N. Virginia)us-east-1

  • 1 database  instance, writer

  • Open Search ptf-test 

    • Data nodes

      • Instance type - r6g.2xlarge.search

      • Number of nodes - 4

      • Version: OpenSearch_2_7_R20240502

    • Dedicated master nodes

      • Instance type - r6g.large.search

      • Number of nodes - 3

  • MSK fse-tenant

    • brokers, kafka.m7g.xlarge brokers in 2 zones

    • Apache Kafka version 3.7.x 

    • EBS storage volume per broker 300 GiB

    • auto.create.topics.enable=true

    • log.retention.minutes=480

    • default.replication.factor=3

 

Cluster Resources

qcp1-pvt (Thu Feb 20 09:58:24 UTC 2025)

Module

Task Definition Revision

Module Version

Task Count

Mem Hard Limit

Mem Soft Limit

CPU Units

Xmx

Metaspace Size

Max Metaspace Size

Module

Task Definition Revision

Module Version

Task Count

Mem Hard Limit

Mem Soft Limit

CPU Units

Xmx

Metaspace Size

Max Metaspace Size

mod-remote-storage

13

5mod-remote-storage:3.2.0

2

4920

4472

0

3960

512

512

mod-finance-storage

13

5mod-finance-storage:8.6.1

2

1024

896

0

700

88

128

mod-sudoc

5

5mod-sudoc:1.0

2

1024

896

0

768

88

128

mod-ebsconet

13

5mod-ebsconet:2.2.0

2

1248

1024

0

700

128

256

edge-sip2

11

5edge-sip2:3.2.7

2

1024

896

0

768

88

128

mod-tags

13

5mod-tags:2.2.0

2

1024

896

0

768

88

128

edge-courses

13

5edge-courses:1.4.5

2

1024

896

0

768

88

128

mod-authtoken

18

5mod-authtoken:2.15.2

2

1440

1152