Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Another potential issue arises if a timer is created for an API in the foo interface, and later the application providing the foo interface is disabled, upgraded to a new version that no longer includes that API, or if the API contract has changed.

There are two main aspects to this:

  1. Gracefully handling error scenarios to reduce unnecessary requests, consuming resources, and spamming logs.

  2. Raising awareness/visibility of these issues to system operators / administrators.

Deliverables

Option 1 Event handler when the system has changed

...

We can send an email each time a timer fails to notify the administrator. This email should include extra information such as the endpoint the timer tried to call, the user who ran it, and the error that caused the failure. This way, the administrator can review the emails and address the failed timers.

NOTE: When coupled with the circuit breaker solution (see Option 2 above), these notifications could be sent only when a timer is paused or unpaused, reducing the number of emails being sent. If we’re still spamming system operators, we could consider creating a digest which is sent out periodically (daily? twice daily? weekly? etc.)

Pros

Simple implementation

...

It can flood the inbox with repeated emails when timers run frequently. (see notes above about mitigating this)

Option 4 Event handling approach during application uninstall

...

Since we’ve decided to go with Option 2 and Option 4, and to implement a circuit breaker pattern, the main focus of the technical delivery will be on how to integrate this into the mod_scheduler service and what approaches we should use. For the future, we have also decided to create a dashboard that can display timers along with their statuses. To achieve this, we will introduce timer statuses in the first iteration of implementation, which the UI will then use to display them on the dashboard.

Own imllementation

Firstly, I think we don't need to introduce a new framework to handle the circuit breaker pattern; we should try to implement it using the existing Quartz framework. Since Quartz is already established for the service and supports distributed job handling, the idea is to pause the timer job each time it fails and create a circuit breaker job, which will be scheduled to execute based on the number of timer failures. We will store the number of failures for the timer job in the timer descriptor, along with the timer statuses.

mod_scheduler.png

Hystrix

The most popular circuit breaker pattern library, built by Netflix, is Hystrix. The implementation will be straightforward: we'll wrap our call to the timer endpoint with a HystrixCommand, running asynchronously and managing all statuses if the call fails. Each HystrixCommand should have a unique key, which should match the timer job key.

pros

popular and easy implementation

cons

"multi instances" implementation - each node has completely separate circuit breakers and there's no shared data between nodes

resilience4j

Another popular library that implements the circuit breaker pattern provides a circuit breaker registry, allowing us to register our circuit breakers. It can be customized to use not only in-memory storage but also shared services like an in-memory database or a traditional database.

Pros:

  • Easy to implement

  • The registry can be shared across instances

cons

Extend Time Descriptor to add two new properties: failed status and failure count

Code Block
{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "TimerDescriptor",
  "description": "Timer",
  "type": "object",
  "properties": {
    "id": {
      "description": "Timer identifier",
      "type": "string",
      "format": "uuid"
    },
    "modified": {
      "description": "Whether modified",
      "type": "boolean"
    },
    "routingEntry": {
      "$ref": "routingEntry.json",
      "description": "Proxy routing entry"
    },
    "enabled": {
      "description": "Whether enabled",
      "type": "boolean"
    },
    "moduleName": {
      "description": "Module name timer belongs to",
      "type": "string"
    },
    "failedlastFailedDate": {
      "description": "DidThe date of the last execution failfailure.",
      "type": "boolean"
    },
    "failureCount": {
      "description": "Failure count",
      "type": "integer"
    }
  },
  "required": [ "routingEntry", "enabled" ]
}
Drawio
mVer2
zoom1
simple0
inComment0
custContentId440533013
pageId352092161
lbox1
diagramDisplayNameUntitled Diagram-1725541056479.drawio
contentVer2
revision2
baseUrlhttps://folio-org.atlassian.net/wiki
diagramNameUntitled Diagram-1725541056479.drawio
pCenter0
width706
links
tbstyle
height335.5

Each time the timer fails, we will increment the failure count by 1, and if it runs successfully, we will reset the count to 0. This value will determine when to schedule the circuit breaker job to resume the timer. To schedule the circuit breaker job, we will use a SimpleTrigger that fires at a specific time without repeating.

For each circer bbrecer job we add We add a job data item which include for each circuit breaker job that includes the timer job key to allow it to be able to resume itresumed. No additional interaction with databse not the database is required. Once the circle job expected circuit breaker job has been triggered at least once, it should be deleted.

An additional REST endpoint needs to be added to the scheduler module to allow manual resumption of timers. This will enable the operations team to monitor timers via the dashboard, identify and resolve issues that caused failures, and resume the timers immediately, without waiting for the circuit breaker job to resume them in the next cycle.

...

I recommend choosing Option 2, as it provides transparency for the administrator by displaying which timers have failed, and their durations, and allows make decisions based on this dashboard.Base our

After discussing with the architecture team, we decided to pursue option 2 as a first step. This addresses the first concern and paves the way for later work to address the visibility concern.

Spike Status:

Status
colourGreen
titleComplited