Spike - Investigate options for mod-scheduler handling changes in the system

Spike Overview

JIRA ID: EUREKA-212 Investigate options for mod-scheduler handling changes in the system

Objective: When mod-scheduler triggers a timer, it impersonates the user who created that timer when making the API request. This ensures that a user cannot create a timer that performs actions they aren't authorized to do. However, if the user’s role or capability assignments change, or if the user is disabled or removed, the timer will no longer function.

Another potential issue arises if a timer is created for an API in the foo interface, and later the application providing the foo interface is disabled, upgraded to a new version that no longer includes that API, or if the API contract has changed.

There are two main aspects to this:

Gracefully handling error scenarios to reduce unnecessary requests, consuming resources, and spamming logs.
Raising awareness/visibility of these issues to system operators / administrators.

Deliverables

Option 1 Event handler when the system has changed

The approach needs to handle various events, such as application updates, tenant disabling, and user capability updates. This requires complex business logic to check how each event might affect timers and disable them accordingly. Additionally, we need a dashboard displaying disabled timers with error explanations and reasons for the disablement so that administrators can resolve the issues.

Pros

The logic for disabling timers will be handled automatically, allowing the administrator to review the reason for the timer being disabled and decide whether to fix the issue or remove the timer.

Cons

The business logic may become complex as it involves mapping events to timers, which could be affected and might introduce errors that need to be identified and addressed.

Option 2 Circuit Breaker pattern

The approach will still allow timers to fail, but we will handle this more proactively. Based on the circuit breaker status, we will build a dashboard where administrators can see which timers have failed, for how long, and the cause of the error (e.g., route not found, connection timeout, access denied). This means administrators can check the capability settings for the user who ran the timer or remove the timer from the tenant if the required application is disabled/uninstalled.

The Circuit Breaker pattern has three states: Open, Closed, and Half-Open.

Closed State: The circuit breaker functions normally when the timer runs successfully. However, if timer failures exceed a threshold (to be decided), the circuit breaker trips and switches to the "open" state.

Open State: In this state, the timer will not try to execute in the next scheduled period.

Half-Open State: After a specified duration (to be decided), the circuit breaker transitions to the "half-open" state, allowing the timer to run in the next scheduled period. If it succeeds, the circuit breaker resets to the "closed" state, and the timer will operate as usual. If the requests fail, the circuit breaker returns the timer to the "open" state until the next timeout period.

Pros:

Simple implementation

Provides transparency for administrators to see which timers failed and for how long

Cons:

Users must resolve or disable the timers themselves and investigate the cause of the failures

Option 3 email notification when the timer failed

We can send an email each time a timer fails to notify the administrator. This email should include extra information such as the endpoint the timer tried to call, the user who ran it, and the error that caused the failure. This way, the administrator can review the emails and address the failed timers.

NOTE: When coupled with the circuit breaker solution (see Option 2 above), these notifications could be sent only when a timer is paused or unpaused, reducing the number of emails being sent. If we’re still spamming system operators, we could consider creating a digest which is sent out periodically (daily? twice daily? weekly? etc.)

Pros

Simple implementation

Cons

It can flood the inbox with repeated emails when timers run frequently. (see notes above about mitigating this)

Option 4 Event handling approach during application uninstall

The solution will include functionality to remove all system timers associated with applications that have just been removed. We can obtain a list of these timers from the application descriptor and remove them accordingly. However, we cannot remove user timers linked to the uninstalled application. To determine which user timers need to be removed, we should create a new section in the module descriptor/application descriptor to define all possible endpoints that can be used for timers within the module. This approach allows us to map these endpoints to the timers and remove them as necessary. However, if an application is just updated to a new version, user timers should not be deleted.

Delivery

Since we’ve decided to go with Option 2 and Option 4, and to implement a circuit breaker pattern, the main focus of the technical delivery will be on how to integrate this into the mod_scheduler service and what approaches we should use. For the future, we have also decided to create a dashboard that can display timers along with their statuses. To achieve this, we will introduce timer statuses in the first iteration of implementation, which the UI will then use to display them on the dashboard.

Firstly, I think we don't need to introduce a new framework to handle the circuit breaker pattern; we should try to implement it using the existing Quartz framework. Since Quartz is already established for the service and supports distributed job handling, the idea is to pause the timer job each time it fails and create a circuit breaker job, which will be scheduled to execute based on the number of timer failures. We will store the number of failures for the timer job in the timer descriptor, along with the timer statuses.

Extend Time Descriptor to add two new properties: failed status and failure count

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "TimerDescriptor",
  "description": "Timer",
  "type": "object",
  "properties": {
    "id": {
      "description": "Timer identifier",
      "type": "string",
      "format": "uuid"
    },
    "modified": {
      "description": "Whether modified",
      "type": "boolean"
    },
    "routingEntry": {
      "$ref": "routingEntry.json",
      "description": "Proxy routing entry"
    },
    "enabled": {
      "description": "Whether enabled",
      "type": "boolean"
    },
    "moduleName": {
      "description": "Module name timer belongs to",
      "type": "string"
    },
    "lastFailedDate": {
      "description": "The date of the last failure.",
      "type": "boolean"
    },
    "failureCount": {
      "description": "Failure count",
      "type": "integer"
    }
  },
  "required": [ "routingEntry", "enabled" ]
}

Each time the timer fails, we will increment the failure count by 1, and if it runs successfully, we will reset the count to 0. This value will determine when to schedule the circuit breaker job to resume the timer. To schedule the circuit breaker job, we will use a SimpleTrigger that fires at a specific time without repeating.

We add a job data item for each circuit breaker job that includes the timer job key to allow it to be resumed. No additional interaction with the database is required. Once the circuit breaker job has been triggered at least once, it should be deleted.

An additional REST endpoint needs to be added to the scheduler module to allow manual resumption of timers. This will enable the operations team to monitor timers via the dashboard, identify and resolve issues that caused failures, and resume the timers immediately, without waiting for the circuit breaker job to resume them in the next cycle.

Conclusion

I recommend choosing Option 2, as it provides transparency for the administrator by displaying which timers have failed, and their durations, and allows make decisions based on this dashboard.

After discussing with the architecture team, we decided to pursue option 2 as a first step. This addresses the first concern and paves the way for later work to address the visibility concern.

Spike Status: COMPLITED