EUREKA-89 Spike - Design solution for scheduled system calls

Spike Overview

User Story: EUREKA-89 Spike - Design solution for scheduled system calls

Objective: Select an approach to the problem of routing mod-scheduler calls to system interfaces invoked by predefined schedule (aka timers)

Background

Timers defined in module descriptors sometimes specify endpoints which are not defined as part of regular/standard interfaces. That means information about these is not included in the bootstrap/discovery information sidecars use to route module-to-module calls. Presently, mod-scheduler sidecar sends timer requests directly to Kong in order to simplify things. Also mgr-tenant-entitlement registers timer interfaces as routes in Kong. Without that sidecars would need to either get all discovery information, or obtain new discovery info whenever a new timer is created.

Problem Statement

We discovered that some _timer interfaces point to system interfaces, which, for good reason should not be routed through Kong or at least made publicly available in Kong. This means that calls from mod-scheduler to these endpoints will fail. We must find a solution which works for requests to both public and system interfaces.

Scope

In Scope

  • Egress request routing from mod-scheduler

  • Requests to both public and system interfaces

  • Changes/enhancements to Eureka core components (i.e module sidecars, and/or mod-scheduler, and/or Kong)

Research Questions

  1. How should mod-scheduler calls be routed to system interfaces?

  2. What is the relative effort and complexity for each of the solutions?

Deliverables

There are several options but all of them originates from two ideas:

  1. Route system interfaces in Kong, but in such a way that only mod-scheduler is the only valid source of these requests.

  2. Have mod-scheduler send egress requests to its sidecar like every other module, and extend sidecar with ability to retrieve necessary routing information to handle these requests.

Option 1.1

Route all timer request (for regular and system interfaces) through Kong, but in the case of system interfaces, allow requests coming from the internal sub-net only and block any calls to system interfaces from the outside. This type of barrier can be enforced with custom Kong plugin, let’s call it Private Resources Barrier (PRB) plugin.

The plugin should be aware of network boundaries where Eureka cluster with Folio modules is deployed. How exactly this information can provided to the plugin is TBD. With that knowledge requests to system interfaces can be filtered by requestor’s IP address on Kong’s side:

  • request comes from internal node (including a node with mod-scheduler) → let it pass through;

  • request comes from external network (internet) → forbid and return "404 Route not found", as it's done in case of unknown route.

The following diagram displays main components and actors involved in the flow

Folio modules and their sidecars are deployed inside a cluster with predefined private network (marked as “Intranet” on the diagram). Among other modules, the deployment contains mod-scheduler to run scheduled jobs, and some business module, named “Module A”. Module A provides regular interface /regular-url-A along with _timer interface /timer-url-A. Detailed information about Module A interfaces contained as usual in its Module Descriptor.

  1. At first Tenant Entitlement manager (MTE) enables an application with Module A for a tenant. During this process MTE gets Module Descriptor of Module A and

a. creates routes in Kong for all module’s public interfaces from "provides" section of Module Descriptor

b. creates routes in Kong for _timer interfaces. Each newly created route marked with a special tag “private“ (or “internal“, or “system“ – up to us to decide on the naming) to identify interface as internally available only

image-20240502-122538.png

c. creates (via Kafka) scheduled jobs for Module A _timer endpoints in mod-scheduler

  1. scheduled job is triggered for /timer-url-A endpoint. Request goes from mod-scheduler to its sidecar, which in turns forwards the request to Kong, since the request URL is not registered inside sidecar’s egress routing table

    1. Kong receives the request and searches for known routes associated with the given URL and method.

    2. Once the route is found, PRB plugin checks if it has “private“ tag assigned.

      • For /timer-url-A the tag is assigned and the plugin additionally checks if the request comes from the internal network

      • in case of a call from mod-scheduler it’s true and the plugin let the request be forwarded to target Module A sidecar.

    3. Finally Module A sidecar calls /timer-url-A interface of the module and the chain of calls succeeds.

  2. External actor (UI or another system) attempts to request /timer-url-A system interface. Kong performs the same steps to handle the request as it does in case of a call from mod-scheduler:

    1. the route is searched in the list of registered routes

    2. then PRB plugin is called to apply additional validation:

      • since the route has “private“ tag assigned, the request is also tested to check if the caller belongs to internal network

      • external actor doesn’t belong to internal network so the plugin rejects this call with "404 Route not found" error.

Open questions

  1. what are the options for defining boundaries of internal network and providing this information to PRB plugin

  2. what are the ways to inject the plugin into Kong request processing flow

    1. is it possible to run the plugin just right after Kong has found a route

    2. will the selected route be available to the plugin

Pros

  • the implementation is expected to be reasonably simple. it will affect only MTE and Kong

  • mod-scheduler and folio-module-sidecar remain unchanged

  • the approach can also address Public/Private API problem

Cons

  • plugin development in Kong requires knowledge of Lua language but Eureka team has limited experience with it

Option 1.2

At the moment scheduler executes jobs created from timer interfaces on behalf of a system user which is granted access to all the resources in the system. This system user is automatically created during tenant entitlement process by mod-users-keycloak. Also a sidecar doesn’t perform any authorization checks when it receives a request to call a timer endpoint of the module, even though authorization context contains system user’s token.

The idea is to continue using this system user but enable authorization checks on in sidecar for timer endpoints. This way only users with granted access will be allowed to call timer interfaces. In case of the existing system user, it already has access to all resources including timer ones, so authorization will be successful. On the other hand, if a timer interface is called by an arbitrary user, without proper rights, then the request will be rejected by sidecar as forbidden.

The following diagram displays main components and actors involved in the flow (note that most of the interactions are already in place)

The diagram contains mod-scheduler to run scheduled jobs, and some business module, named “Module A”. Module A provides regular interface /regular-url-A along with _timer interface /timer-url-A. Detailed information about Module A interfaces contained as usual in its Module Descriptor.

  1. At first Tenant Entitlement manager (MTE) enables an application with Module A for a tenant. During this process MTE gets Module Descriptor of Module A and

a. creates routes in Kong for all module’s public interfaces from "provides" section of Module Descriptor and all _timer interfaces so that they are also become available for routing by API Gateway

b. creates resources in Keycloak but only for public interfaces of the module. There are not resources created for timer interfaces, so Keycloak is still not aware of any system interfaces and has no specific rules (permissions) to authorize access to those resources

c. enables mod-users-keycloak module for the tenant. During this process the module creates special system user, named like <tenant>-system-user (example diku2-system-user), with "System" role which grants access to all resources in the system

image-20240516-085853.png
diku2-system-user in Keycloak, created by mod-users-keycloak

d. creates (via Kafka) scheduled jobs for Module A _timer endpoints in mod-scheduler. Scheduler associates the system user with jobs created from timer interfaces. As a result these jobs will be executed on behalf of the system user

  1. scheduled job is triggered for /timer-url-A endpoint. mod-scheduler prepares a request and put impersonated token for the system user into x-okapi-token request header. Then the request goes from mod-scheduler to its sidecar, which in turns forwards the request to Kong, since the request URL is not registered inside sidecar’s egress routing table

    1. Kong receives the request and searches for known routes associated with the given URL and method.

      • Once the route is found, the request will be forwarded to target Module A sidecar

    2. Module A sidecar receives the request and as usual performs several steps to authorize request:

      • get token from request header

      • parse token

      • call Keycloak to evaluate permissions. Since the system user has access to all resources, authorization will be successful

    3. Finally Module A sidecar calls /timer-url-A interface of the module and the chain of calls succeeds

  2. External actor (UI or another system) attempts to request /timer-url-A system interface.

    1. Kong performs the same steps to handle the request as it does in case of a call from mod-scheduler:

      • the route is searched in the list of registered routes

      • once the route is found, the request will be forwarded to target Module A sidecar

    2. Module A sidecar receives the request and tries to authorize request in Keycloak

      • external actor has no permission to access /timer-url-A system interface, thus Keycloak will reject the call with "403 Forbidden" error.

Implementation details

The system already supports the above flow except one piece: request authorization for _timer interfaces in sidecar. Currently timer interfaces are treated as system ones, along with _tenant interfaces. For system interfaces in general sidecar skips token analyses and authorization procedures, but from now on it shouldn’t do this for timer interfaces.

The logic to verify user token and access grants resides inside classes called filters. They have shouldSkip(RoutingContext rc) method to understand whether or not a filter should be applied to particular request. Method example from KeycloakAuthorizationFilter:

So to enable authorization for timer interfaces shouldSkip(RoutingContext rc) method should be modified in several filter:

  • KeycloakJwtFilter

  • KeycloakTenantFilter

  • KeycloakAuthorizationFilter (the main one responsible for sending authorization request to Keycloak)

Pros

  • the implementation is simple. it will affect only sidecar code

  • protects timer interfaces in a standard way via Keycloak authorization, like all other resources

  • can be extended later to support more accurate resource-based access as opposite to “all resources“ access

Cons

  • system user has unrestricted access to all system resources. Ideally it would be more correct to have some dedicated user and role (like “Run scheduled jobs”) with access to timer interfaces only. Right now it’s not that easy to implement because timers usually do not require any permissions and it’s tricky to build the right capabilities for them and automatically assign to some role

Option 2

Have mod-scheduler send egress requests to it’s sidecar like every other module, and add a switch to the module-sidecar which indicates it should process all scheduled job events to build and maintain routing information. These information should be stored per tenant basis. It also has to be saved to DB so that after sidecar restarting jobs requests can be handled properly.

Pros

  • No special handling required in Kong

  • No security concerns

Cons

  • the most complicated solution

  • the sidecar needs to retrieve and manage discovery and interface/endpoint information for all scheduled job APIs (system / public interfaces) in the system.

  • scheduler specific logic introduced in sidecar code which is of general purpose

  • potential grow of memory consumption due to increased volume of routing information

  • sidecar would need to manage a permanent storage

  • partial intersection of responsibilities with Kong, it already manages routes for each tenant. Sidecar would need to do the same but for a smaller group of routes

Option 3

Sidecars dynamically retrieve discovery information on an as-needed basis. If this fails, fallback to routing the request to Kong. Maybe this requires a new endpoint in mgr-tenant-entitlements, or maybe not.

Pros

  • No special handling required in Kong

  • No security concerns

  • No need for the sidecars to get ALL discovery information, only what’s required, when it’s required.

Cons

  • Multiple ways for sidecars to get discovery information

    • Ask during startup, then get updates via kafka

    • Retrieve on an as-needed basis

  • Possibly larger memory footprint required for the mod-scheduler sidecar

    • Could be mitigated by adding a trait to endpoints indicating whether or not they’re eligible for being scheduled

Risks & Assumptions

  1. Risk 1

  2. Risk 2 ...

  3. Assumption 1

  4. Assumption 2 ...

Future Considerations

The use of an all-powerful system user is a potential risk which we’d like to address at some point. The high level idea is to instead provision and use system users with distinct capabilities. In order for this to happen changes would need to be made to module descriptors since many of these timers are defined w/o requiredPermissions. One potential gotcha here is with backward compatibility. We need to ensure that specifying required permissions on these system timer interfaces will not break things when using OKAPI.

Conclusion

Let’s go with 1.2, and in parallel look into how we can possibly adopt option 3 as well. We’d like to consolidate the mechanisms used by the sidecars for obtaining discovery information. We don’t want 2 or 3 ways this is done.

Spike Status: Completed

Attachments

Include any relevant attachments, such as documents, diagrams, or presentations that support the spike