...
Another potential issue arises if a timer is created for an API in the foo interface, and later the application providing the foo interface is disabled, upgraded to a new version that no longer includes that API, or if the API contract has changed.
There are two main aspects to this:
Gracefully handling error scenarios to reduce unnecessary requests, consuming resources, and spamming logs.
Raising awareness/visibility of these issues to system operators / administrators.
Deliverables
Option 1 Event handler when the system has changed
...
We can send an email each time a timer fails to notify the administrator. This email should include extra information such as the endpoint the timer tried to call, the user who ran it, and the error that caused the failure. This way, the administrator can review the emails and address the failed timers.
NOTE: When coupled with the circuit breaker solution (see Option 2 above), these notifications could be sent only when a timer is paused or unpaused, reducing the number of emails being sent. If we’re still spamming system operators, we could consider creating a digest which is sent out periodically (daily? twice daily? weekly? etc.)
Pros
Simple implementation
...
It can flood the inbox with repeated emails when timers run frequently. (see notes above about mitigating this)
Option 4 Event handling approach during application uninstall
...
I recommend choosing Option 2, as it provides transparency for the administrator by displaying which timers have failed, and their durations, and allows make decisions based on this dashboard.Base our
After discussing with the architecture team, we decided to pursue option 2 as a first step. This addresses the first concern and paves the way for later work to address the visibility concern.
Spike Status:
Status | ||||
---|---|---|---|---|
|