Skip to end of banner
Go to start of banner

Investigation into the sidecar's interaction with the AWS Parameter Store.

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Spike Overview

Currently, we have seen that sidecars are sometimes unable to retrieve the parameters from the AWS parameter store to retrieve necessary information, like passwords for system users, which causes the problem that it can’t work properly.

Problem Statement

During our investigation into the sidecar performance issue related to authorization in Keycloak, we observed that some sidecars cannot authorize with Keycloak to retrieve the system user token. Upon further analysis, we discovered that these sidecars fail to retrieve passwords for system users from the AWS Parameter Store because they exceed the allowed rate limit, resulting in a "Rate Exceeded" exception. 

2024-12-27 08:34:14,774 WARN  [org.fol.sid.ser.tok.SystemUserTokenProvider] (ForkJoinPool.commonPool-worker-3) Failed to obtain system user token: message = org.folio.tools.store.exception.NotFoundException: software.amazon.awssdk.services.ssm.model.SsmException: Rate exceeded (Service: Ssm, Status Code: 400, Request ID: d56a44a2-3184-4833-ae13-b73e32fd2852): java.util.concurrent.ExecutionException: org.folio.tools.store.exception.NotFoundException: software.amazon.awssdk.services.ssm.model.SsmException: Rate exceeded (Service: Ssm, Status Code: 400, Request ID: d56a44a2-3184-4833-ae13-b73e32fd2852)

Below are the statistics from the last 24 hours showing the sidecars that couldn’t access the AWS Parameter Store to authorize requestsEnvnt_count indicates the number of times the system failed to reach SSM to retrieve the password

image (2).png

It appears that around 10 modules are unstable, causing significant issues with bulk edits, circulation, data exports, data imports, and other workflows.

Deliverables

Interim solution

  1. Increase Throughput:
    Increase the AWS Parameter Store throughput from 40 to 10,000 requests per second. 

However, this would raise the cost for the Bugfest cluster, with an estimated additional expense of $30–$60 per day for the cluster.

  1. Deploy Vault in the Cluster:
    Deploy Vault within the cluster and reconfigure all sidecars to use this setup. This approach does not require additional costs because we can reuse the existing host. However, it would require us to manage the availability, maintenance, backup, and recovery plans ourselves moving forward.

strategic solution

To optimize performance and improve the code, we need to reduce the number of calls made to the AWS Parameter Store. These calls should only be made when it is absolutely necessary to retrieve the system user password.

Feedback from Dima:

  • The mod-scheduler module does not use a system user, making these requests unnecessary; however, they are still being performed.

  • The sidecar currently has a condition to determine whether it should request a system user token.

image003.png

This condition is not functioning as intended. It would make more sense to implement a condition like this:

image007.png

 Proposed Improvement:

  • Updating the validation logic to account for scenarios like the mod-scheduler could help.

  • Specifically, modifying the validation as follows would prevent unnecessary requests from mod-scheduler’s sidecar, as it already receives an x-okapi-token for timer requests. 

image008.png

because mod-scheduler’s sidecar gets x-okapi-token per timer request

image001.png

Conclusion

Refining the logic for system user password retrieval and ensuring proper validation will significantly reduce unnecessary calls to the AWS Parameter Store, improving overall efficiency.

Spike Status: IN PROGRESS

  • No labels