Batch Importer (Bib/Acq) (UXPROD-47)

[UXPROD-3074] Create monitoring task for Data Import app Created: 14/May/21  Updated: 15/Jun/21  Resolved: 15/Jun/21

Status: Closed
Project: UX Product
Components: None
Affects versions: None
Fix versions: R1 2021
Parent: Batch Importer (Bib/Acq)

Type: New Feature Priority: P2
Reporter: Ann-Marie Breaux (Inactive) Assignee: Khalilah Gambrell
Resolution: Done Votes: 0
Labels: data-import
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Defines
defines UXPROD-47 Batch Importer (Bib/Acq) Analysis Complete
is defined by MODDATAIMP-420 [Spike] Create monitoring task for Da... Closed
is defined by MODDATAIMP-432 Create notifications Closed
is defined by MODSOURMAN-458 Support monitoring table creation and... Closed
is defined by MODSOURMAN-459 Support monitoring table update on jo... Closed
is defined by MODSOURMAN-460 Implement watchdog timer to monitor t... Closed
Release: R1 2021 Hot FIx #2
Epic Link: Batch Importer (Bib/Acq)
Development Team: Spitfire
PO Rank: 0

 Description   

Update:  Approved as R1 2021 Hot Fix at Capacity Planning Team meeting on May 17, 2021.   If there isn't time, this will be included in Juniper.

Requirement

  • Health check manager for whole Data Import process
  • This task extends that work to alert a configurable e-mail list if the monitor finds that Data import is no longer running
  • E-mail list (configurable)
    • [decide exactly who at PTF standup Weds]
  • Message should include
    • Message date and time stamp
    • Job number
    • File name
    • Job profile
    • Job start date and time
    • Job stop date and time (if it stopped)

NOTE: Alert e-mails would go to hosting providers for live libraries, or library staff for self-hosted, or TBD for hosted ref envs and Bugfest

Other questions/topics from 20200421 Folijet standup:

  • Allow external monitoring system to hookup messages from logs and notify users
  • MODSOURMAN-426 Closed only monitors individual jobs that stop: monitor all of the DI flow and notify if it breaks or stops is a larger effort
  • Maybe just use hooks to pull AWS info
  • Topics to decide with DI Libraries
    • How frequently should it poll and can it be customized
    • Being able to notify a list of e-mails (customizable) when something breaks or stops
    • Should this automatically stop and log jobs as failed if they are still showing as running on the landing page, but stop activity after certain period of time?

Week of 26 April: Oleksii Kuzminov will arrange for Folijet and Spitfire to meet with Vladimir Shalaev to decide on a strategy and create stories. Discuss with Khalilah Gambrell and Ann-Marie Breaux and confirm team(s) by the end of the week.



 Comments   
Comment by Ann-Marie Breaux (Inactive) [ 14/May/21 ]

Discussed with Khalilah Gambrell and decided to turn this into a feature, assigned to Spitfire. Copied the spike details into the description, and linked the stories to it. Also created a task to account for additional work that will govern the notifications

Comment by Khalilah Gambrell [ 17/May/21 ]

Goal is to have this work complete by Hotfix Iris Release #2 but can push to Juniper.

Comment by Mike Gorrell [ 25/May/21 ]

Will this health check be accessible via an API so that we may call it from our existing monitoring tools/processes? Will there be any granularity in what is flagged as a problem - i.e. "stuck job" vs "nothing running" ?

Comment by Vladimir Shalaev [ 25/May/21 ]

This is a message in a log like "Job XXX seems to be stuck". It has no recover mechanism (thou we can add it with relative ease - send another message if the job previously notified gets back to life).

So in general it's a 'signal', which can be parsed and reacted by any monitoring toolset

Generated at Fri Feb 09 00:29:08 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.