EUREKA-211: Spike - Revisit routing of mod-scheduler requests

Spike Overview

User Story: https://folio-org.atlassian.net/browse/EUREKA-211 Spike - Revisit routing of mod-scheduler requests

Objective: Revisit the decisions made in the wake of EUREKA-89: Spike - Design solution for scheduled system callsClosed

See spike findings here: EUREKA-89 Spike - Design solution for scheduled system calls

The reason to revisit this is that Kong may not know how to route requests to these system interfaces in cases where multiple modules have timers that call the same system interface. It’s also not great that Kong is even aware of system interfaces, given that it only routes traffic from outside the system.

We may want to consider alternative solutions that avoid these problems. For example, we could make the mod-scheduler sidecar aware of all discovery and other relevant info or be able to retrieve it on an as-needed basis.

Problem Statement

Kong's inability to route requests correctly to system interfaces and its awareness of these interfaces are problematic. This routing confusion arises when multiple modules' timers call the same system interface, which is beyond Kong's intended functionality of managing external traffic.

Scope

This spike will explore alternative solutions to the current design to improve request routing for system interfaces. Potential approaches include enhancing the mod-scheduler sidecar to be aware of or retrieve necessary discovery and other relevant information dynamically. The goal is to devise a solution that bypasses the limitations faced by Kong in its current role.

Deliverables

The main idea is to work in detail on option 3, described in the EUREKA-89 Spike - Design solution for scheduled system calls.

There are three possible solutions here, but all of them originate from the same ideas: mgr-applications and mgr-tenant-entitlements contain all the information to identify a correct route that should be used for a particular “timer call” from the mod-scheduler. So, these “timer“ routes should not be configured in Kong, and all the discovery information could be retrieved dynamically or available when it is needed.

Option 3.1

In general, this option repeats the idea of security and principles described in option 1.2 of the spike EUREKA-89 Spike - Design solution for scheduled system calls, but eliminates the Kong from the timer flows.

The following diagram displays the main components and actors involved in the flow.

Description

This option introduces a new component, mgr-discovery, that acts as a facade for the Sidecars and provides a single point of interaction between them and the mgr-* components. Sidecar will use mgr-discovery to read all bootstrap info related to the FOLIO module the Sidecar serves and dynamically retrieve discovery information that will be used for timer routings.

Timer Flow

  1. mod-scheduler prepares a request and puts impersonated token for the system user into x-okapi-token request header. Then the request goes from mod-scheduler to its sidecar.

  2. Because Sidecar (mod-scheduler) has dynamic routing enabled and the requested route is not known, it calls the mgr-discovery to resolve the discovery information.

  3. mgr-discovery calls mgr-Tenant-Entitlements to resolve the actual module's version.

  4. mgr-discovery calls mgr-Applications to retrieve the actual discovery information that points to the Sidecar (Module A).

  5. Sidecar (mod-scheduler) dynamically routes the request to the Sidecar (Module A).

  6. Sidecar (Module A) authorizes the request to Module A.

    1. get a token from a request header.

    2. parse the token.

    3. call Keycloak to evaluate permissions. Since the system user has access to all resources, authorization will be successful.

  7. Sidecar (Module A) forwards the request to Module A for processing.

Key Functions

  1. Bootstrap info reading:

    • When a request for bootstrap data is received, the module mgr-discovery performs calls to mgr-Tenant-Entitlements and mgr-Applications to collect and compile all the information needed for the Sidecar.

  2. Routes Resolution and Cache with Expiration Policy:

    • Purpose: The module mgr-discovery maintains a cache of routes that includes information about various system/timer interfaces and their respective routes. This cache has an expiration policy to ensure that the information remains up-to-date and valid.

    • Functionality: When a request for a specific route (e.g., /timer-urlA) is received, the module mgr-discovery checks its cache for the corresponding route. If the route information is available and valid, it proceeds to use it. If the information is expired or not found, the module queries mgr-Tenant-Entitlements and mgr-Applications to fetch and update the route information.

  3. Lightweight and Simple: the mgr-discovery module does not use any persistent storage and integrates only with mgr-Tenant-Entitlements and mgr-Applications

  4. Single Integration Point for Sidecars: Sidecars will interact with the mgr-discovery only, which will simplify the Sidecar implementation as it no longer has to integrate with different mgr-* modules.

Pros

  • Kong is excluded from the timer workflow. The system/timer interfaces are not configured in Kong and are not available from the outside.

  • The implementation is relatively simple. it will affect only the sidecar and a new mgr-discovery module

  • The Sidecar’s codebase will be simplified as it will interact with the mgr-discovery module only and the rest of the complex logic will be encapsulated in the mgr-discovery

  • No need to track the timer interface changes as the discovery info will be resolved dynamically and stored in the cache with a short retention time.

  • No need for the sidecars to get ALL discovery information, only what’s required, when it’s required.

  • Protects timer interfaces in a standard way via Keycloak authorization, like all other resources

  • Can be extended later to support more accurate resource-based access as opposed to “all resources“ access

Cons

  • Multiple ways for sidecars to get discovery information when “Dynamic Routing“ is enabled.

  • Multiple modules for dynamic routing resolution increase the overall complexity of the system.

  • Dynamic routing can potentially introduce additional latency due to the extra steps required to resolve discovery info. The usage of the cache should mitigate the issue.

  • mgr-discovery becomes a critical component. if it fails, it could disrupt the entire routing and bootstrapping mechanism. HA and redundancy should be used to mitigate this.

Option 3.2

Description

This option is similar to option 3.1 but removes the Sidecar from the routing process. The mod-scheduler uses mgr-discovery to resolve the actual route based on a provided URL, module name, and tenant ID.

Timer Flow:

  1. mod-scheduler prepares a request and puts an impersonated token for the system user into the x-okapi-token request header. The request is then sent directly to mgr-discovery.

  2. mgr-discovery receives the request and checks its cache for the corresponding route information. If not found or expired, it proceeds to resolve the route.

  3. mgr-discovery calls mgr-Tenant-Entitlements to resolve the actual module's version.

  4. mgr-discovery calls mgr-Applications to retrieve the actual discovery information that points to the Sidecar (Module A).

  5. mgr-scheduler routes the request directly to the Sidecar (Module A).

  6. Sidecar (Module A) authorizes the request to Module A:

    1. get a token from a request header.

    2. parse the token.

    3. call Keycloak to evaluate permissions. Since the system user has access to all resources, authorization will be successful.

  7. Sidecar (Module A) forwards the request to Module A for processing.

Key Functions

  1. Central Routing: mgr-scheduler acts as the central routing component for all timer requests, eliminating the need for routing logic in the Sidecars.

  2. Routes Resolution and Cache with Expiration Policy: Similar to option 3.1, mgr-discovery maintains a cache of routes with an expiration policy to ensure up-to-date routing information.

  3. Direct Integration with mod-scheduler: mod-scheduler integrates directly with mgr-discovery for all timer requests, simplifying the request flow.

  4. Lightweight and Simple: The mgr-discovery module does not use any persistent storage and integrates only with mgr-tenant-entitlements and mgr-applications.

Pros

  • No changes in the Sidecar implementation.

  • Centralizes routing decisions in mgr-discovery, potentially making it easier to manage and update routing logic.

  • Reduces the number of hops in the request flow, potentially improving performance.

  • Maintains the benefits of dynamic routing and caching from option 3.1.

Cons

  • mod-scheduler becomes aware of mgr-* components, which breaks the separation of concerns and potentially introduces tighter coupling between these components.

  • Increases complexity in mod-scheduler as it now needs to handle route resolution logic.

  • Could potentially increase the load on mod-scheduler for frequent route resolutions.

  • Deviates from the standard pattern where modules interact with the system through their Sidecars.

Option 3.3

Description

This option modifies the behavior of the Sidecar to read all timer/all routes during the bootstrap phase, eliminating the need for dynamic routing resolution during runtime. The mgr-discovery component still plays a role in providing bootstrap information, including timer routes.

Timer Flow

  1. During the bootstrap phase, the Sidecar reads all timer routes from mgr-discovery, which retrieves this information from mgr-Tenant-Entitlements and mgr-Applications.

  2. mod-scheduler prepares a request and puts an impersonated token for the system user into the x-okapi-token request header. The request is then sent to its Sidecar.

  3. The Sidecar (mod-scheduler) already has the routing information for timer requests and directly forwards the request to the appropriate Sidecar (Module A).

  4. Sidecar (Module A) authorizes the request to Module A:

    1. get a token from a request header.

    2. parse the token.

    3. call Keycloak to evaluate permissions. Since the system user has access to all resources, authorization will be successful.

  5. Sidecar (Module A) forwards the request to Module A for processing.

Key Functions

  1. Bootstrap Timer Route Loading: During the bootstrap phase, the Sidecar loads all timer routes from mgr-discovery, which compiles this information from mgr-tenant-entitlements and mgr-applications.

  2. Static Routing: The Sidecar uses the pre-loaded routing information for timer requests, eliminating the need for dynamic routing resolution during runtime.

  3. Lightweight mgr-discovery: The mgr-discovery module primarily serves bootstrap information and doesn't need to handle runtime route resolution requests.

Pros:

  • Simplifies runtime routing by eliminating the need for dynamic resolution.

  • Potentially improves performance by reducing the number of interactions required for each timer request.

  • Reduces load on mgr-discovery during normal operation.

Cons:

  • Lacks a clear mechanism for updating routing information when changes occur (e.g., during the entitlements process for different tenants and applications).

  • May lead to outdated routing information if changes occur after the bootstrap phase.

  • Increases memory usage in Sidecars as they need to store all timer routes.

  • Less flexible than dynamic routing options when dealing with frequent changes or a large number of routes.

Gap: The main gap in this option is the lack of a clear mechanism to notify the Sidecar about routing changes made by the mgr-tenant-entitlements module during the entitlements process for different tenants and applications. This is represented by the bold red arrow between mgr-tenant-entitlement and Sidecar in the diagram.

Potential solutions to this gap could include:

  1. Implementing a notification system where mgr-tenant-entitlements pushes updates to affected Sidecars. (Kafka?)

  2. Having Sidecars periodically poll for updates to their routing information.

Resolving this gap is crucial for ensuring that the Sidecars always have up-to-date routing information, especially in a dynamic multi-tenant environment where entitlements may change frequently.

Option 3.4

Description

In this option, the mgr-discovery component is removed, and its functionality is fully integrated into the Sidecar. This allows the Sidecar to handle all dynamic route discovery, caching, and forwarding processes internally. The architecture simplifies the overall system by reducing external dependencies while still enabling dynamic routing through an explicit configuration mechanism.

A key aspect of this option is the introduction of a Dynamic Routing Enabled/Disabled switch within the Sidecar. This switch must be explicitly enabled via a configuration parameter for the Sidecar to support dynamic routing. When dynamic routing is disabled, the Sidecar relies on preloaded configured routes. When enabled, the Sidecar dynamically discovers and routes requests using the internal cache and communication with external components, such as mgr-Tenant-Entitlements and mgr-Applications if a route is not present in the preconfigured set.

Components Involved:

  1. mod-Scheduler: Handles timer-based jobs that initiate requests for various timers interfaces.

  2. Sidecar (mod-Scheduler): Now equipped with additional functionality for dynamic routing. It manages route discovery, caching with expiration policies, and directly interacts with other system modules.

  3. Sidecar (Module A): Receives the routed request from the Sidecar (mod-Scheduler) and forwards it to the appropriate internal modules after performing authorization.

  4. Keycloak: Provides authorization services, ensuring that requests are authenticated with a system token. The system user has access to all resources.

  5. mgr-Tenant-Entitlements: Provides module version information required for routing.

  6. mgr-Applications: Supplies the discovery information necessary for dynamic routing and module interactions.

  7. mod-Users-Keycloak: Manages the creation and enablement of system users for authentication purposes.

Flow:

  1. The mod-Scheduler prepares a request (POST to a timer URL) and includes an impersonated token for a system user in the request header. It sends this request to the Sidecar (mod-Scheduler).

  2. The Sidecar (mod-Scheduler) checks its internal Routes Cache for the required route. If the route information is available and still valid, it proceeds to step 5. If the route is expired or missing, it dynamically resolves the route by retrieving data from external modules.

  3. The Sidecar (mod-Scheduler) calls mgr-Tenant-Entitlements to resolve the version of the module enabled for a current tenant (Module A) that has timers associated with the requested URL.

  4. The Sidecar (mod-Scheduler) then queries mgr-Applications to obtain the actual route for the resolved module version.

  5. Once the route is determined, the Sidecar (mod-Scheduler) forwards the request to the appropriate Sidecar (Module A).

  6. The Sidecar (Module A) verifies the request by retrieving and validating the system token. It uses Keycloak to evaluate the permissions, confirming that the system user has access to all resources.

  7. After successful authorization, the Sidecar (Module A) forwards the request to Module A for further processing.

Key Features:

  1. Internal Routes Cache with Expiration Policy: The Sidecar (mod-Scheduler) maintains a cache of routes with an expiration policy to ensure up-to-date and valid routing information. When a route expires, the Sidecar automatically refreshes the route information by querying mgr-Tenant-Entitlements and mgr-Applications.

  2. Dynamic Route Discovery: The Sidecar is responsible for dynamically resolving routes for module versions and interfaces by querying the required information from mgr-Tenant-Entitlements and mgr-Applications. This eliminates the need for external mgr-discovery.

  3. Authorization via Keycloak: All interactions with Module A are secured using Keycloak, ensuring that only authenticated system users with appropriate permissions can access the resources.

  4. Self-Sufficient Sidecar: The Sidecar (mod-Scheduler) is now a self-sufficient component that handles all aspects of route discovery, caching, and dynamic routing without relying on external components. This reduces system complexity and increases the Sidecar's autonomy in managing requests.

Pros:

  1. Reduced System Complexity: By eliminating the mgr-discovery component and integrating its functionality into the Sidecar, the overall system architecture is simplified. This reduces the need for external dependencies and minimizes potential points of failure.

  2. Centralized Routing Logic: The Sidecar now handles all aspects of routing, including dynamic route resolution and caching, reducing the need for multiple components and simplifying management.

  3. Improved Performance: The internal route cache with an expiration policy allows the Sidecar to quickly resolve frequently used routes, reducing the latency associated with querying external components repeatedly.

  4. Greater Autonomy for Sidecars: Each Sidecar becomes more self-sufficient, handling route discovery and caching independently, which can make the system more scalable and less reliant on centralized routing services.

  5. Easier Maintenance: With fewer moving parts (no need for a separate mgr-discovery component), maintaining and troubleshooting the system becomes easier, with a more focused logic inside the Sidecar.

Cons:

  1. Increased Sidecar Complexity: By integrating route discovery and caching functionalities into the Sidecar, the component becomes more complex to develop, maintain, and debug. This increased complexity could lead to longer development cycles and more potential bugs.

  2. Higher Memory and Resource Usage: The Sidecar now stores route information in a cache, potentially increasing its memory footprint. For systems with many routes, this could require more memory and CPU resources to manage effectively.

  3. Potential Cache Staleness: While the cache has an expiration policy, there is still a risk that the route information could become outdated before the cache refreshes. This could lead to routing errors until the cache is updated.

  4. Lack of Centralized Route Management: Since each Sidecar now independently handles route discovery, ensuring consistency across different instances of Sidecars could be more challenging, especially in multi-tenant environments.

  5. Greater Dependency on Sidecar Robustness: The Sidecar becomes a critical point of failure. If a Sidecar encounters an issue, routing could be disrupted for that particular service, affecting its ability to process requests effectively.

  6. Potential Latency from Cache Expiration: When the cache expires, the Sidecar needs to query external modules to refresh routing information. This could introduce additional latency during cache refreshes, particularly in high-traffic deployments.

Conclusion

During several rounds of discussion, option 3.4 was selected because it best suits the requirements. We’d like to consolidate the mechanisms used by the sidecars to obtain discovery information.

Spike Status: Completed