[TLS][Eureka] Sidecers: Establishing Connection and Disconnecting Consumes Too Many Resources

Description

The Sidecar service reestablishes TLS (Transport Layer Security) connections every 15 seconds (when IDLE) to key components such as modules, databases, Kafka brokers, etc. This frequent reconnection process is highly resource-consuming, particularly due to the TLS handshake, which involves a series of computationally expensive steps like key exchange, certificate validation, and encryption/decryption. These operations significantly affect system performance and can introduce delays, making the service inefficient and prone to resource exhaustion.

Observed Behavior:

  • Sidecars initiate a new TLS connection every 15 seconds (when IDLE), impacting various connected components.

  • The TLS handshake process during connection establishment increases CPU and memory usage due to computationally expensive cryptographic operations.

  • This recurring reconnection pattern places heavy strain on system resources and causes noticeable performance degradation.

Performance Impact:

  • 🔴 Performance Overhead – Establishing a new TLS connection every 15 seconds increases CPU and memory usage due to the expensive handshake process (key exchange, certificate validation, etc.). This results in inefficient resource utilization and can cause system bottlenecks.

  • 🔴 Latency – Each TLS handshake introduces additional latency before data transmission. For latency-sensitive applications, this delay could become a significant bottleneck, resulting in slow data processing and unresponsiveness.

  • 🔴 Resource Consumption – With multiple sidecars and high-frequency connections, the system is under heavy resource load. This constant connection cycling can exhaust system resources (CPU, memory, network bandwidth), leading to a drop in overall system performance, especially under heavy traffic.

  • 🔴 Rate Limiting – Some backend services (e.g., Kafka, databases) may have rate limits on the frequency of new connections. Repeated connection attempts every 15 seconds could trigger rate limiting, causing connection failures or timeouts, which would further degrade the application’s stability.

chrome_B9izULj9JU.png
chrome_Z8TtbaiE1m.png
chrome_GV0A26aORh.png

The resource consumption is similar to a service with a lost database because both scenarios involve frequent connection establishment and teardown. In the case of a lost database, the service may continually try to reconnect, just like in the scenario where TLS connections are repeatedly re-established every 15 seconds.

Here's why:

  1. Constant Reconnection Attempts: When the database is lost or unreachable, the service repeatedly tries to reconnect, which consumes CPU and memory. Similarly, repeatedly re-establishing TLS connections incurs the same overhead, even if no actual data transfer is occurring.

  2. Handshake Overhead: Both scenarios involve costly operations like connection handshakes (TLS or database authentication), which are resource-intensive.

  3. Idle Connection Management: Whether trying to connect to a lost database or frequently re-establishing TLS connections, both cases result in system resources being used inefficiently (e.g., CPU and memory for connection setup, retries, and waiting for responses).

In short, in both cases, the service is consuming resources unnecessarily through constant, resource-heavy connection attempts without achieving meaningful work, leading to similar performance impacts.

 

 

environment:

https://tls-eureka-bugfest-ramsons-consortium.int.aws.folio.org/
https://tls-eureka-bugfest-ramsons.int.aws.folio.org/
https://tls-eureka-bugfest-ramsons-plus.int.aws.folio.org
https://tls-eureka-bugfest-ramsons-aqa.int.aws.folio.org

relctls1 (PTF TLS Eureka)

CSP Request Details

None

CSP Rejection Details

None

Potential Workaround

None

Attachments

11
  • 19 Feb 2025, 09:02 AM
  • 19 Feb 2025, 09:02 AM
  • 19 Feb 2025, 09:02 AM
  • 19 Feb 2025, 09:02 AM
  • 12 Feb 2025, 08:23 AM
  • 12 Feb 2025, 08:23 AM
  • 12 Feb 2025, 08:16 AM
  • 11 Feb 2025, 12:40 PM
  • 11 Feb 2025, 12:40 PM
  • 11 Feb 2025, 12:40 PM
  • 11 Feb 2025, 12:31 PM

Checklist

hide

Activity

Show:

Martin Tran February 26, 2025 at 1:51 PM

Investigation complete, observations are shared in the ticket.

Leonid Kolesnykov February 19, 2025 at 9:02 AM
Edited

Our observations after fix deploy:

  1. CPU usage of modules in idle state decreased on modules side significantly if to compare with ticket description. It stabilized.

  2. But we still see high rate of connections/disconnections on sidecars. For example: 2 actions per minute in idle state (rebftls/sidecar-mod-data-import), 2600 per minute (rebftls/sidecar-mod-inventory-storage/) during CI/CO.

  3. Resource utilization during CI/CO tests higher than in previous reports (recp1) probably because of those reconnections

    2025-02-18_15h01_00-20250218-130133.png
    2025-02-18_15h08_46-20250218-131108.png
  4. Comparing CI/CO with 8 and 20 vUsers response times (RT) degraded in 20 vUsers test (Check-out RT almost twice)

  5. Observed spikes for some modules (200%-300% mod-remote-storage), constant high usage of some modules in idle state 100%-110% (mod-tlr, mod-requests-mediates)

    image-20250219-085453.png
  6. The same time Instances' CPU utilization was not higher than 35% during tests.

    image-20250219-085404.png

CC:

Denis February 17, 2025 at 1:47 PM

The fix for sidecars' continuous reconnection was prepared by FSE team.

Today modules of TLS versions of Eureka R bugfest were redeployed using that fix by

PTF team and e2e force will be checking if the issue is resolved fully.

cc

Done

Details

Assignee

Reporter

Priority

Development Team

EBSCO - FSE

RCA Group

TBD

TestRail: Cases

Open TestRail: Cases

TestRail: Runs

Open TestRail: Runs

Created February 11, 2025 at 12:03 PM
Updated February 26, 2025 at 1:51 PM
Resolved February 26, 2025 at 1:51 PM
TestRail: Cases
TestRail: Runs

Flag notifications