MODSIDECAR-85 - Investigation into the sidecar's interaction with the AWS Parameter Store.
JiraTicket - MODSIDECAR-85: Spike - investigate options for reducing number of Secret Store requestsClosed
Spike Overview
Currently, we have seen that sidecars are sometimes unable to retrieve the parameters from the AWS parameter store to retrieve necessary information, like passwords for system users, which causes the problem that it can’t work properly.
Keycloak Resource and Sidecar Issues
Keycloak struggles to store all resources required for authorization when we have more than 10 realms. Each realm contains approximately 1,500 resources, and the default cache size of 10,000 was insufficient. To address this, we increased the cache size to 80,000 items.
Additionally, we observed excessive overhead caused by requests to Keycloak using incorrect credentials. This occurs because sidecars fail to retrieve passwords from the AWS Parameter Store.
Problem Statement
During our investigation into the sidecar performance issue related to authorization in Keycloak, we observed that some sidecars cannot authorize with Keycloak to retrieve the system user token. Upon further analysis, we discovered that these sidecars fail to retrieve passwords for system users from the AWS Parameter Store because they exceed the allowed rate limit, resulting in a "Rate Exceeded" exception.
2024-12-27 08:34:14,774 WARN [org.fol.sid.ser.tok.SystemUserTokenProvider] (ForkJoinPool.commonPool-worker-3) Failed to obtain system user token: message = org.folio.tools.store.exception.NotFoundException: software.amazon.awssdk.services.ssm.model.SsmException: Rate exceeded (Service: Ssm, Status Code: 400, Request ID: d56a44a2-3184-4833-ae13-b73e32fd2852): java.util.concurrent.ExecutionException: org.folio.tools.store.exception.NotFoundException: software.amazon.awssdk.services.ssm.model.SsmException: Rate exceeded (Service: Ssm, Status Code: 400, Request ID: d56a44a2-3184-4833-ae13-b73e32fd2852)
Below are the statistics from the last 24 hours showing the sidecars that couldn’t access the AWS Parameter Store to authorize requestsEnvnt_count
indicates the number of times the system failed to reach SSM to retrieve the password
It appears that around 10 modules are unstable, causing significant issues with bulk edits, circulation, data exports, data imports, and other workflows.
Deliverables
Interim solution
Increase Throughput:
Increase the AWS Parameter Store throughput from 40 to 10,000 requests per second. However, this would raise the cost for the Bugfest cluster, with an estimated additional expense of $30–$60 per day.We need to reduce the number of calls made to the AWS Parameter Store to optimize performance and improve the code. These calls should only be made when retrieving the system user password is absolutely necessary.
Feedback from Dima:
The
mod-scheduler
module does not use a system user, making these requests unnecessary; however, they are still being performed.The sidecar currently has a condition to determine whether it should request a system user token.
This condition is not functioning as intended. It would make more sense to implement a condition like this:
Proposed Improvement:
Updating the validation logic to account for scenarios like the
mod-scheduler
could help.Specifically, modifying the validation as follows would prevent unnecessary requests from
mod-scheduler
’s sidecar, as it already receives anx-okapi-token
for timer requests.
because mod-scheduler’s sidecar gets x-okapi-token per timer request
Fixes Implemented:
SSM Issues:
Problem: Sidecars attempt to retrieve passwords for system users even when it's unnecessary. This happens because the
x-okapi-user-id
header does not include a user ID.Root Cause: Modules fail to copy incoming request headers into outgoing requests.
Solution: Updated system user validation logic in the sidecar to avoid unnecessary password retrievals.
Potential Impact: This fix may affect other flows, requiring additional testing.
Unnecessary Password Retrievals for Modules:
Problem: Sidecars attempt to retrieve system user passwords for modules like
mod-scheduler
, where such requests are unnecessary.
AWS Parameter Store Limits:
Problem: Sidecars are unable to retrieve passwords from SSM due to the rate limit of 40 requests per second. Each sidecar makes password requests every 300 seconds for all system users across all tenants, often at the same time.
Example Calculation:
Number of sidecars: ~70
Number of system users per tenant: ~16
Number of tenants: ~15
Total requests: ~16,800 requests every 300 seconds
Strategic Solution
Option 1: Deploy Vault in the Cluster
Deploy Vault within the cluster and reconfigure all sidecars to use this setup. This solution incurs no additional costs as we can reuse the existing host. However, we will need to take full responsibility for managing availability, maintenance, backups, and recovery plans moving forward.
Pros:
Backward compatible with old sidecars, making it easy to implement within the local cluster.
Requires no additional development from the Eureka team, which currently lacks sufficient capacity.
Cons:
Estimated effort for the DevOps team is approximately 2 months.
Option 2: Add an In-Memory Cache for Parameters, Shared Across Sidecars
In this approach, once a parameter is retrieved by one sidecar, it is stored in a shared in-memory database with a configurable TTL. Subsequent requests from other sidecars will access the parameter from memory, reducing interactions with AWS Parameter Store.
Pros:
Can be implemented quickly.
AWS recommends a similar caching approach for Lambda functions to optimize parameter retrieval.
(Reference: AWS Blog)Creates a shared component accessible by all sidecars.
Cons:
Requires additional development effort from the Eureka team.
May not apply to clusters using older sidecars.
Price for the AWS service
Option 3: Develop a Dedicated Service as a Secure Key Store
Create a custom service to act as a secure key store, storing data in memory and functioning as a proxy for the AWS SSM Parameter Store. The service would implement the same protocol as AWS SSM, ensuring no changes are required on the sidecar's end.
Pros:
Compatible with older sidecar versions without requiring modifications.
remove an AWS and Vault library from the side car will free some resources for sidecars
Cons:
Requires significant development effort.
Option 4: Add Internal Sidecar In-Memory Cache
Currently, the sidecar caches tokens for service and system users, typically for 5 minutes, depending on the token expiration period. Once the token expires, the sidecar reauthorizes the user and retrieves the password from AWS SSM. Since system and service user passwords are set during the entitlement process and only change upon request by the DevOps team, it is unnecessary to fetch them from AWS SSM every time. Instead, we can store the password in the sidecar's memory to reduce the number of calls to the AWS SSM property store. The cache duration can be configured via an environment variable.
Pros:
Significantly reduces the number of external calls to the AWS SSM property store.
Conclusion:
I suggest adopting a hybrid approach by combining Option 4 with Option 3. However, they should be implemented separately. Option 4 can be implemented quickly and provide immediate benefits in the short term.
Spike Status: COMPLETED