V.1

MODDATAIMP-420 - Getting issue details... STATUS

Participants:
Solution Architect	Vladimir Shalaev
Product Owner	Ann-Marie Breaux (Deactivated) Khalilah Gambrell
Java Lead	Oleksandr_Dekin Oleksii Kuzminov

Spike Purpose:

Problem:

The current implementation of the data-import does not provide an automated solution to define if job is "stuck" and there is no way to notify responsible people about such situation.

The identifying of that kind of situation is a manual task.

Goal:

Provide a solution to automate monitoring of job executions whose progress stopped (stuck) and notify users about this situation

Phase 1 JobExecution Monitoring

Database table structure:

The proposed solution is to create a separate table with sample name "job_monitoring" to accumulate information about jobExecutions with following structure:

job_execution_id	last_event_timestamp	notification_sent
5a289466-8ae4-446e-9718-74c21845cee3	2021-04-29T18:12:15.607+0000	false
49d97e22-61ae-452b-8497-5ff6d68ba9f4	2021-04-29T18:07:12.607+0000	true

where

job_execution_id - UUID identifier of the job

last_event_timestamp - timestamp of the last updation for jobExecution

notification_sent - boolean value that indicates whether notification was send or not

example above based on sample assumption:

current date - 2021-04-29T18:14:17.607+0000

time unit to designate stuck execution - 5 min

Database table maintenance:

"job_monitoring" table will be populated with data on following actions:

INSERT:

- on initialization of JobExecution at JobExecutionProgressDaoImpl in initializeJobExecutionProgress method.

UPDATE:

- on any updation of job progress at JobExecutionProgressDaoImpl in updateByJobExecutionId method.

Also we need to reset the state of "notification_sent" flag in the database to allow future notifications on given job. this will also allow us to send a 'recover' notification when job is back alive (if we update an entry when flag is set)

DELETE:

- once the jobExecution is finished (i.e status is COMPLETED/ERROR) the associate row from monitoring table should be deleted. In case of data-import this is equivalent when

number of total chunks for jobExecution = getCurrentlySucceeded + getCurrentlyFailed. This purpose serves updateJobExecutionIfAllRecordsProcessed method in RecordProcessedEventHandlingServiceImpl class.

JobExecution monitoring

The main idea of monitoring is to have some job/timer/task inside of the system to watch on specific actions. It is proposed to use Watchdog timer for this purpose.

When the system detects the job stopped, it writes this information (including job_execution_id, and maybe more parameters, based on notification template requested) into logger, using a fixed predefined pattern and log level.

For example: log level = ERROR and message is "Data Import Job with jobExecutionId = %job_execution_id% not progressing"

After this step the monitoring job updates boolean flag in the database "notification_sent" to "true"

The environment (AWS, Kibana or what ever is used to monitor the installation) is set up to track those messages in a log. If the message is detected, it's split into tokens (ex. jobExectutionId) and the alert is generated.

The monitoring system should have a recipient/recipient group to be set up to send an email on this alert.

Requirements:

configurable start time - TBD time unit

Question	Answer
Time unit to monitor?	for the v.1 default time is 20 min.
Log message	level: ERROR; message: "Data Import Job with jobExecutionId = %job_execution_id% not progressing"
Notification channel	for v.1 it is dependent on customer external monitoring tool(AWS, Kibana, etc.)

Task List	Jira	high level estimation(story points)
Create monitoring table and insert record when jobExecution is created	MODSOURMAN-458 - Getting issue details... STATUS	5
Update monitoring table(updated_date) on change of jobExecution	MODSOURMAN-459 - Getting issue details... STATUS	5
Remove row from monitoring table once jobExecution is finished	MODSOURMAN-459 - Getting issue details... STATUS	3
Implement watchdog timer to monitor table	MODSOURMAN-460 - Getting issue details... STATUS	8

For the v.1 of "Spike: Monitoring for data-import" it is decided that the monitoring(sending emails, maintaining receivers list) will be covered by external tools(AWS, Kibana, etc.) depends on the customer setup.

Phase 2 Receivers list

Requirements:

E-mail list (configurable)

Question

Answer

How the receivers list will be populated?

DevOps setup them
some endpoint on UI
etc.

Task List	Jira	high level estimation(story points)

Phase 3 Sending emails

Requirements: Ann-Marie Breaux (Deactivated)to finalize requirements with librarians and add here, along with sample e-mail mockup

Message should include

Message date and time stamp (localized date and time)
Job number
User who started the job
File name
Job profile
Job start date and time
Job stop date and time (if it stopped)

Question	Answer
Sample mockup for email?

Task List	Jira	high level estimation(story points)

Configuration:

No UI planned at this time
Vladimir Shalaevdocumenting configuration details for implementing in various environments:
- AWS
- Kubernetes
- On prem
Next steps: Will review with DevOps and then link here
Should this be directed at the hosting provider only (as first iteration)? Still TBD; will take more additional work if customized to individual tenants

Improvements:

per Oleksii Kuzminov - different levels of WARN messages might be considered in next implementation version
per Vladimir Shalaev - job stucked analyzer improvement - if all other job is also not updated within predefined unit of time - this might be considered as some module is about to restart.

Folio Development Teams

Spike MODDATAIMP-420 - Create monitoring task for Data Import app

Spike Purpose:

Problem:

Goal:

Phase 1 JobExecution Monitoring

Database table structure:

Database table maintenance:

JobExecution monitoring

Requirements:

Phase 2 Receivers list

Requirements:

Phase 3 Sending emails

Requirements: Ann-Marie Breaux (Deactivated)to finalize requirements with librarians and add here, along with sample e-mail mockup

Configuration:

Improvements:

Related Documents: