MODSIDECAR-105 - Investigate slow performance when in TLS mode..
JiraTicket - MODSIDECAR-105: Investigate slow performance when in TLS mode.In bugfix review
Spike Overview
Performance Impact of TLS Mode
During our analysis, we observed that enabling TLS mode significantly impacts performance, with some workflows running 3-5 times slower compared to non-TLS mode. The primary reason for this performance degradation appears to be cryptographic processing, handled by the Bouncy Castle library. Additionally, the system does not reuse TLS sessions, resulting in frequent full TLS handshakes, which further slow down operations.
To improve performance, two key areas need to be addressed:
Reducing CPU consumption when the cluster is idle.
Enhancing performance when sidecars interact with each other.
Problem Statement: Frequent Reconnection When Cluster is Idle
Observations from Log Analysis
While investigating logs, we identified an issue where unused clusters and all sidecars continuously connect and disconnect from third-party services such as SSM and Kafka. Additionally, we noticed a pattern of incoming connections from external sources.
To address this, we categorized the reconnection issues and implemented solutions accordingly.
Kafka Reconnection Problem (Resolved)
Before applying configuration changes, Kafka reconnected approximately every 30 seconds. After implementing the optimizations, reconnections now occur every 5 minutes.
To minimize frequent Kafka reconnections, the following parameters were adjusted:
Kafka Connection Optimization Parameters
# Keep TCP connections alive
MP_MESSAGING_INCOMING_DISCOVERY_KEEP_ALIVE = true
MP_MESSAGING_INCOMING_ENTITLEMENT_KEEP_ALIVE = true
MP_MESSAGING_INCOMING_LOGOUT_KEEP_ALIVE = true
# Increase session timeout to prevent frequent reconnects
MP_MESSAGING_INCOMING_DISCOVERY_SESSION_TIMEOUT_MS = 30000 # 30 seconds
MP_MESSAGING_INCOMING_ENTITLEMENT_SESSION_TIMEOUT_MS = 30000 # 30 seconds
MP_MESSAGING_INCOMING_LOGOUT_SESSION_TIMEOUT_MS = 30000 # 30 seconds
# Configure heartbeat intervals
MP_MESSAGING_INCOMING_DISCOVERY_HEARTBEAT_INTERVAL_MS = 10000 # 10 seconds
MP_MESSAGING_INCOMING_ENTITLEMENT_HEARTBEAT_INTERVAL_MS = 10000 # 10 seconds
MP_MESSAGING_INCOMING_LOGOUT_HEARTBEAT_INTERVAL_MS = 10000 # 10 seconds
# Increase max poll interval to prevent consumer disconnection
MP_MESSAGING_INCOMING_DISCOVERY_MAX_POOL_INTERVAL_MS = 60000 # 60 seconds
MP_MESSAGING_INCOMING_ENTITLEMENT_MAX_POOL_INTERVAL_MS = 60000 # 60 seconds
MP_MESSAGING_INCOMING_LOGOUT_MAX_POOL_INTERVAL_MS = 60000 # 60 seconds
SSM Reconnection Issue (Pending Investigation)
Each time parameters are retrieved, a new connection is established with SSM. The following log snippets illustrate the frequent connection establishment and disconnection cycles:
Log Example
2025-02-11T11:54:32.354+0000 [865026] INFO ProvTlsClient [client #7235] established connection with ssm.us-west-2.amazonaws.com:443
2025-02-11T11:54:32.343+0000 [081598] INFO ProvTlsClient [client #7235] opening connection to ssm.us-west-2.amazonaws.com:443
2025-02-11T11:54:32.339+0000 [081598] INFO ProvTlsClient [client #7231] disconnected from ssm.us-west-2.amazonaws.com:443
A possible solution is to cache the retrieved parameters to reduce unnecessary reconnections. This has already been implemented but requires further investigation.
Load Balancer TLS Health Checks
The load balancer performs health checks on TLS endpoints every 15 seconds (depending on the environment). Each sidecar gets a minimum of two health check requests per cycle.
Proposed Solution
Increase the health check interval to reduce TLS handshake overhead.
Exclude the health check page from SSL, allowing the load balancer to use a non-TLS endpoint.
Module-to-Module Communication Issue (Resolved)
Analysis of logs revealed that during some workflows, module-to-module calls repeatedly establish new TLS connections instead of reusing existing sessions. This issue, known as TLS Session Resumption, can significantly impact performance due to redundant TLS handshakes.
Log Example: Repeated TLS Handshakes
2025-03-17T09:29:13.735 INFO ProvTlsClient [client #1873] disconnected from mod-inventory-storage-b
2025-03-17T09:29:13.729 INFO ProvTlsClient [client #1874] established connection with mod-inventory-storage-b
2025-03-17T09:29:13.722 INFO ProvTlsClient [client #1872] disconnected from mod-inventory-storage-b
What is TLS Session Resumption?
TLS session resumption allows a client and server to reuse an existing TLS session, avoiding the need for a full TLS handshake. This significantly reduces the computational overhead involved in key exchange, certificate verification, and random number generation.
How TLS Session Resumption Works
Key Mechanisms
There are two primary methods to resume TLS sessions:
Session IDs (Stateful Resumption):
Server stores session state (e.g., encryption keys) and assigns a unique Session ID to the client.
Client sends the Session ID in subsequent requests to resume the session.
Stateful: The server must maintain a cache of session IDs and their associated keys.
Deprecated in TLS 1.3 in favor of session tickets.
Session Tickets (Stateless Resumption):
Server encrypts session state into a Session Ticket and sends it to the client.
Client stores the ticket and includes it in subsequent requests to resume the session.
Stateless: The server does not need to store session state (ideal for scalability)
DNS-based load balancing across Availability Zones (AZs) can impact TLS session resumption depending on the resumption mechanism (session IDs or tickets) and configuration. Here’s how:
1. Session Resumption Mechanisms
Mechanism | Description | Impact of DNS-Based Load Balancing |
---|---|---|
Session IDs | Server stores session state (keys, cipher suite). | Breaks resumption if servers don’t share state. |
Session Tickets | Session state encrypted and stored client-side. Requires shared server keys. | Works if servers share Session Ticket Encryption Keys (STEK). |
AWS-Specific Considerations
Application Load Balancer (ALB):
ALBs handle TLS termination and manage STEK internally, so session tickets work across AZs.
Potential Solutions Under Investigation
Bouncy Castle in FIPS mode supports TLS session resumption with session tickets, but with specific constraints:
1. Support for Session Resumption in FIPS Mode
Session Tickets:
Bouncy Castle's FIPS-compliant implementation (BCJSSE in FIPS mode) supports TLS session resumption via session tickets when configured properly.FIPS-Approved Algorithms: Session tickets are encrypted using FIPS-approved algorithms (e.g., AES-GCM).
TLS Versions: Works with TLS 1.2 and TLS 1.3 (if enabled).
Session IDs:
Session IDs (stateful resumption) are not supported in FIPS mode due to scalability and compliance concerns.
Bouncy Castle FIPS-Specific Behavior
Bouncy Castle’s FIPS provider enforces cryptographic restrictions, affecting how TLS session resumption operates. It allows only certain cryptographic algorithms while prohibiting others, such as RSA key exchange for simultaneous signing and encryption.
Supported TLS Ciphers in Bouncy Castle FIPS Mode
TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
. Use FIPS-Compliant Cipher Suites
Only cipher suites approved for FIPS are allowed. For example:
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
TLS_AES_128_GCM_SHA256
(TLS 1.3)
Next Steps
Investigate how to enable TLS session resumption within Bouncy Castle.
Delivery
After investigating Bouncy Castle and reviewing its documentation, it was identified that TLS 1.3 session resumption requires support for a specific Pre-Shared Key (PSK) standard.
What is PSK?
PSK (Pre-Shared Key) is a shared secret used in cryptographic systems, particularly in symmetric key algorithms, where both parties have exchanged the secret through a secure channel beforehand.
Key Aspects of PSK:
Usage: PSKs are used in various security protocols, including Wi-Fi encryption (WPA-PSK), Extensible Authentication Protocol (EAP-PSK), and TLS 1.3 session resumption.
Security: The security of PSKs depends on their secrecy and randomness. If compromised, all communications using the key could be exposed.
Key Derivation: PSKs are often used with key derivation functions to generate session keys for encrypting data.
Bouncy Castle & PSK Support
Bouncy Castle supports the PSK standard, but it is not included in the BouncyCastleJsseProvider
—which is the provider we use as a standard Java JSSE Provider to override the default one and maintain a unified SSL context across all connections.
Issue Reference:
Details about this limitation can be found in the following issue:
🔗 GitHub Issue #1604
Example Code from the Library:
java
CopyEdit
JsseSessionParameters jsseSessionParameters = new JsseSessionParameters( sslParameters.getEndpointIdentificationAlgorithm(), matchedSNIServerName); // TODO[tls13] Resumption/PSK boolean addToCache = provServerEnableSessionResumption && !TlsUtils.isTLSv13(context); this.sslSession = sslSessionContext.reportSession(peerHost, peerPort, connectionTlsSession, jsseSessionParameters, addToCache);
Conclusion
Currently, in Bouncy Castle, we can only use TLS 1.2. In all other cases, session resumption will not work.
Supported TLS Ciphers in Bouncy Castle FIPS Mode:
TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
What Was Changed?
Added an environment variable:
yaml
CopyEdit
QUARKUS_HTTP_SSL_PROTOCOLS: "TLSv1.2"
Applied changes from the relevant branch.
Communication with the Keycloak Issue (Resolved)
Reused the same approach from the previous topic and applied it to all web and HTTP clients across sidecars.
Communication with Secret storage (In progress)
Currently, the implementation for retrieving secret information, such as system user passwords, is handled by the AWS SSM service and is encapsulated in a separate library outside the sidecar implementation. Based on the logs, it is still using TLSv1.3 with the Bouncy Castle not supported for session resumption. To resolve this issue, changes need to be applied to the library as well.
2025-03-24T13:12:18.427Z 2025-03-24 13:12:18,427 INFO [org.bou.jss.pro.ProvTlsClient] (executor-thread-4) [client #16 @4d2c76f5] opening connection to ssm.us-east-1.amazonaws.com:443
2025-03-24T13:12:18.428Z 2025-03-24 13:12:18,427 INFO [org.bou.jss.pro.ProvTlsClient] (executor-thread-2) [client #13 @4359ff27] established connection with ssm.us-east-1.amazonaws.com:443
2025-03-24T13:12:18.431Z 2025-03-24 13:12:18,431 INFO [org.bou.jss.pro.ProvTlsClient] (executor-thread-3) [client #14 @4f5a5ce6] established connection with ssm.us-east-1.amazonaws.com:443
2025-03-24T13:12:18.444Z 2025-03-24 13:12:18,440 FINE [org.bou.jss.pro.ProvTlsClient] (executor-thread-4) [client #16 @4d2c76f5] notified of selected protocol version: TLSv1.3
2025-03-24T13:12:18.444Z 2025-03-24 13:12:18,440 FINE [org.bou.jss.pro.ProvTlsClient] (executor-thread-4) [client #16 @4d2c76f5]: Server did not specify a session ID
2025-03-24T13:12:18.444Z 2025-03-24 13:12:18,440 FINE [org.bou.jss.pro.ProvTlsClient] (executor-thread-4) [client #16 @4d2c76f5] notified of selected cipher suite: TLS_AES_128_GCM_SHA256
Additionally, I noticed that we are not using the correct FIPS endpoints for communication with the service. According to the documentation, the endpoint should be updated from ssm.us-east-1.amazonaws.com
to ssm-fips.us-east-1.amazonaws.com
.
Summary
The performance impact of TLS, frequent reconnections, and inefficient session management contribute to unnecessary system overhead. By implementing session reuse, optimizing connection settings, and refining module interactions, we aim to improve system stability and efficiency
Spike Status: IN PROGRESS