[TLS][Eureka] Sidecers: Establishing Connection and Disconnecting Consumes Too Many Resources

Description

The Sidecar service reestablishes TLS (Transport Layer Security) connections every 15 seconds (when IDLE) to key components such as modules, databases, Kafka brokers, etc. This frequent reconnection process is highly resource-consuming, particularly due to the TLS handshake, which involves a series of computationally expensive steps like key exchange, certificate validation, and encryption/decryption. These operations significantly affect system performance and can introduce delays, making the service inefficient and prone to resource exhaustion.

Observed Behavior:

Sidecars initiate a new TLS connection every 15 seconds (when IDLE), impacting various connected components.
The TLS handshake process during connection establishment increases CPU and memory usage due to computationally expensive cryptographic operations.
This recurring reconnection pattern places heavy strain on system resources and causes noticeable performance degradation.

Performance Impact:

🔴 Performance Overhead – Establishing a new TLS connection every 15 seconds increases CPU and memory usage due to the expensive handshake process (key exchange, certificate validation, etc.). This results in inefficient resource utilization and can cause system bottlenecks.
🔴 Latency – Each TLS handshake introduces additional latency before data transmission. For latency-sensitive applications, this delay could become a significant bottleneck, resulting in slow data processing and unresponsiveness.
🔴 Resource Consumption – With multiple sidecars and high-frequency connections, the system is under heavy resource load. This constant connection cycling can exhaust system resources (CPU, memory, network bandwidth), leading to a drop in overall system performance, especially under heavy traffic.
🔴 Rate Limiting – Some backend services (e.g., Kafka, databases) may have rate limits on the frequency of new connections. Repeated connection attempts every 15 seconds could trigger rate limiting, causing connection failures or timeouts, which would further degrade the application’s stability.

The resource consumption is similar to a service with a lost database because both scenarios involve frequent connection establishment and teardown. In the case of a lost database, the service may continually try to reconnect, just like in the scenario where TLS connections are repeatedly re-established every 15 seconds.

Here's why:

Constant Reconnection Attempts: When the database is lost or unreachable, the service repeatedly tries to reconnect, which consumes CPU and memory. Similarly, repeatedly re-establishing TLS connections incurs the same overhead, even if no actual data transfer is occurring.
Handshake Overhead: Both scenarios involve costly operations like connection handshakes (TLS or database authentication), which are resource-intensive.
Idle Connection Management: Whether trying to connect to a lost database or frequently re-establishing TLS connections, both cases result in system resources being used inefficiently (e.g., CPU and memory for connection setup, retries, and waiting for responses).

In short, in both cases, the service is consuming resources unnecessarily through constant, resource-heavy connection attempts without achieving meaningful work, leading to similar performance impacts.

environment:

https://tls-eureka-bugfest-ramsons-consortium.int.aws.folio.org/
https://tls-eureka-bugfest-ramsons.int.aws.folio.org/
https://tls-eureka-bugfest-ramsons-plus.int.aws.folio.org
https://tls-eureka-bugfest-ramsons-aqa.int.aws.folio.org

relctls1 (PTF TLS Eureka)

CSP Request Details

None

CSP Rejection Details

None

Potential Workaround

None

Attachments

19 Feb 2025, 09:02 AM
19 Feb 2025, 09:02 AM
19 Feb 2025, 09:02 AM
19 Feb 2025, 09:02 AM
12 Feb 2025, 08:23 AM
12 Feb 2025, 08:23 AM
12 Feb 2025, 08:16 AM
11 Feb 2025, 12:40 PM
11 Feb 2025, 12:40 PM
11 Feb 2025, 12:40 PM
11 Feb 2025, 12:31 PM

Linked issues

causes

PERF-1070

Perform analysis of performance of TLS version on Ramsons Bugfest on Eureka

relates to

MODSIDECAR-105

Investigate slow performance when in TLS mode.

Checklist

hide

Activity

Show:

Martin Tran February 26, 2025 at 1:51 PM

Investigation complete, observations are shared in the ticket.

Leonid Kolesnykov February 19, 2025 at 9:02 AM
Edited

Our observations after fix deploy:

CPU usage of modules in idle state decreased on modules side significantly if to compare with ticket description. It stabilized.
But we still see high rate of connections/disconnections on sidecars. For example: 2 actions per minute in idle state (rebftls/sidecar-mod-data-import), 2600 per minute (rebftls/sidecar-mod-inventory-storage/) during CI/CO.
Resource utilization during CI/CO tests higher than in previous reports (recp1) probably because of those reconnections
Comparing CI/CO with 8 and 20 vUsers response times (RT) degraded in 20 vUsers test (Check-out RT almost twice)
Observed spikes for some modules (200%-300% mod-remote-storage), constant high usage of some modules in idle state 100%-110% (mod-tlr, mod-requests-mediates)
The same time Instances' CPU utilization was not higher than 35% during tests.

CC: @Martin Tran @Hongwei Ji @Craig McNally @Maksym Sinichenko

Denis February 17, 2025 at 1:47 PM

The fix for sidecars' continuous reconnection was prepared by FSE team.

Today modules of TLS versions of Eureka R bugfest were redeployed using that fix by @Maksym Sinichenko

PTF team and e2e force will be checking if the issue is resolved fully.

cc @Martin Tran @Hongwei Ji @Leonid Kolesnykov @Artem Akimov @Craig McNally

Done

Details
Assignee
Leonid Kolesnykov
Reporter
Olga Kondratenko
Priority
TBD
Development Team
EBSCO - FSE
RCA Group
TBD
TestRail: Cases
Open TestRail: Cases
TestRail: Runs
Open TestRail: Runs

Created February 11, 2025 at 12:03 PM

Updated February 26, 2025 at 1:51 PM

Resolved February 26, 2025 at 1:51 PM

Configure

TestRail: Cases

TestRail: Runs

[TLS][Eureka] Sidecers: Establishing Connection and Disconnecting Consumes Too Many Resources

Description

Observed Behavior:

Performance Impact:

CSP Request Details

CSP Rejection Details

Potential Workaround

Attachments

Linked issues

causes

relates to

Checklist

Activity

Martin Tran February 26, 2025 at 1:51 PM

Leonid Kolesnykov February 19, 2025 at 9:02 AM
Edited

Denis February 17, 2025 at 1:47 PM

Details
Assignee
Leonid Kolesnykov
Reporter
Olga Kondratenko
Priority
TBD
Development Team
EBSCO - FSE
RCA Group
TBD
TestRail: Cases
Open TestRail: Cases
TestRail: Runs
Open TestRail: Runs

Details

Assignee

Reporter

Priority

Development Team

RCA Group

TestRail: Cases

TestRail: Runs

Flag notifications

Something's gone wrong

[TLS][Eureka] Sidecers: Establishing Connection and Disconnecting Consumes Too Many Resources

Description

Observed Behavior:

Performance Impact:

CSP Request Details

CSP Rejection Details

Potential Workaround

Attachments

Linked issues

causes

relates to

Checklist

Activity

Martin Tran February 26, 2025 at 1:51 PM

Leonid Kolesnykov February 19, 2025 at 9:02 AMEdited

Denis February 17, 2025 at 1:47 PM

DetailsAssigneeLeonid KolesnykovLeonid KolesnykovReporterOlga KondratenkoOlga KondratenkoPriorityTBDDevelopment TeamEBSCO - FSERCA GroupTBDTestRail: CasesOpen TestRail: CasesTestRail: RunsOpen TestRail: Runs

Details

Assignee

Reporter

Priority

Development Team

RCA Group

TestRail: Cases

TestRail: Runs

Flag notifications

Something's gone wrong

Leonid Kolesnykov February 19, 2025 at 9:02 AM
Edited

Details
Assignee
Leonid Kolesnykov
Reporter
Olga Kondratenko
Priority
TBD
Development Team
EBSCO - FSE
RCA Group
TBD
TestRail: Cases
Open TestRail: Cases
TestRail: Runs
Open TestRail: Runs