[TLS][Eureka] Sidecers: Establishing Connection and Disconnecting Consumes Too Many Resources
Description
CSP Request Details
CSP Rejection Details
Potential Workaround
Attachments
- 19 Feb 2025, 09:02 AM
- 19 Feb 2025, 09:02 AM
- 19 Feb 2025, 09:02 AM
- 19 Feb 2025, 09:02 AM
- 12 Feb 2025, 08:23 AM
- 12 Feb 2025, 08:23 AM
- 12 Feb 2025, 08:16 AM
- 11 Feb 2025, 12:40 PM
- 11 Feb 2025, 12:40 PM
- 11 Feb 2025, 12:40 PM
- 11 Feb 2025, 12:31 PM
causes
relates to
Checklist
hideActivity
Martin Tran February 26, 2025 at 1:51 PM
Investigation complete, observations are shared in the ticket.
Leonid Kolesnykov February 19, 2025 at 9:02 AMEdited
Our observations after fix deploy:
CPU usage of modules in idle state decreased on modules side significantly if to compare with ticket description. It stabilized.
But we still see high rate of connections/disconnections on sidecars. For example: 2 actions per minute in idle state (rebftls/sidecar-mod-data-import), 2600 per minute (rebftls/sidecar-mod-inventory-storage/) during CI/CO.
Resource utilization during CI/CO tests higher than in previous reports (recp1) probably because of those reconnections
Comparing CI/CO with 8 and 20 vUsers response times (RT) degraded in 20 vUsers test (Check-out RT almost twice)
Observed spikes for some modules (200%-300% mod-remote-storage), constant high usage of some modules in idle state 100%-110% (mod-tlr, mod-requests-mediates)
The same time Instances' CPU utilization was not higher than 35% during tests.
CC: @Martin Tran @Hongwei Ji @Craig McNally @Maksym Sinichenko
Denis February 17, 2025 at 1:47 PM
The fix for sidecars' continuous reconnection was prepared by FSE team.
Today modules of TLS versions of Eureka R bugfest were redeployed using that fix by @Maksym Sinichenko
PTF team and e2e force will be checking if the issue is resolved fully.
cc @Martin Tran @Hongwei Ji @Leonid Kolesnykov @Artem Akimov @Craig McNally
The Sidecar service reestablishes TLS (Transport Layer Security) connections every 15 seconds (when IDLE) to key components such as modules, databases, Kafka brokers, etc. This frequent reconnection process is highly resource-consuming, particularly due to the TLS handshake, which involves a series of computationally expensive steps like key exchange, certificate validation, and encryption/decryption. These operations significantly affect system performance and can introduce delays, making the service inefficient and prone to resource exhaustion.
Observed Behavior:
Sidecars initiate a new TLS connection every 15 seconds (when IDLE), impacting various connected components.
The TLS handshake process during connection establishment increases CPU and memory usage due to computationally expensive cryptographic operations.
This recurring reconnection pattern places heavy strain on system resources and causes noticeable performance degradation.
Performance Impact:
🔴 Performance Overhead – Establishing a new TLS connection every 15 seconds increases CPU and memory usage due to the expensive handshake process (key exchange, certificate validation, etc.). This results in inefficient resource utilization and can cause system bottlenecks.
🔴 Latency – Each TLS handshake introduces additional latency before data transmission. For latency-sensitive applications, this delay could become a significant bottleneck, resulting in slow data processing and unresponsiveness.
🔴 Resource Consumption – With multiple sidecars and high-frequency connections, the system is under heavy resource load. This constant connection cycling can exhaust system resources (CPU, memory, network bandwidth), leading to a drop in overall system performance, especially under heavy traffic.
🔴 Rate Limiting – Some backend services (e.g., Kafka, databases) may have rate limits on the frequency of new connections. Repeated connection attempts every 15 seconds could trigger rate limiting, causing connection failures or timeouts, which would further degrade the application’s stability.
The resource consumption is similar to a service with a lost database because both scenarios involve frequent connection establishment and teardown. In the case of a lost database, the service may continually try to reconnect, just like in the scenario where TLS connections are repeatedly re-established every 15 seconds.
Here's why:
Constant Reconnection Attempts: When the database is lost or unreachable, the service repeatedly tries to reconnect, which consumes CPU and memory. Similarly, repeatedly re-establishing TLS connections incurs the same overhead, even if no actual data transfer is occurring.
Handshake Overhead: Both scenarios involve costly operations like connection handshakes (TLS or database authentication), which are resource-intensive.
Idle Connection Management: Whether trying to connect to a lost database or frequently re-establishing TLS connections, both cases result in system resources being used inefficiently (e.g., CPU and memory for connection setup, retries, and waiting for responses).
In short, in both cases, the service is consuming resources unnecessarily through constant, resource-heavy connection attempts without achieving meaningful work, leading to similar performance impacts.
environment:
https://tls-eureka-bugfest-ramsons-consortium.int.aws.folio.org/
https://tls-eureka-bugfest-ramsons.int.aws.folio.org/
https://tls-eureka-bugfest-ramsons-plus.int.aws.folio.org
https://tls-eureka-bugfest-ramsons-aqa.int.aws.folio.org
relctls1 (PTF TLS Eureka)