Spike MODDATAIMP-420 - Create monitoring task for Data Import app

V.1

MODDATAIMP-420 - Getting issue details... STATUS

Spike Purpose:

Problem:

The current implementation of the data-import does not provide an automated solution to define if job is "stuck" and there is no way to notify responsible people about such situation.

The identifying of that kind of situation is a manual task. 

Goal:

Provide a solution to automate monitoring of job executions whose progress stopped (stuck) and notify users about this situation


Phase 1 JobExecution Monitoring

Database table structure:

The proposed solution is to create a separate table with sample name "job_monitoring" to accumulate information about jobExecutions with following structure:

job_execution_idlast_event_timestampnotification_sent
5a289466-8ae4-446e-9718-74c21845cee32021-04-29T18:12:15.607+0000false
49d97e22-61ae-452b-8497-5ff6d68ba9f42021-04-29T18:07:12.607+0000true

where

job_execution_id - UUID identifier of the job

last_event_timestamp - timestamp of the last updation for jobExecution

notification_sent - boolean value that indicates whether notification was send or not

example above based on sample assumption: 

current date - 2021-04-29T18:14:17.607+0000

time unit to designate stuck execution - 5 min

Database table maintenance:

"job_monitoring" table will be populated with data on following actions:

INSERT:

 - on initialization of JobExecution at JobExecutionProgressDaoImpl in initializeJobExecutionProgress method.

UPDATE:

 - on any updation of job progress at JobExecutionProgressDaoImpl in updateByJobExecutionId method.

Also we need to reset the state of "notification_sent" flag in the database to allow future notifications on given job. this will also allow us to send a 'recover' notification when job is back alive (if we update an entry when flag is set)

DELETE:

 - once the jobExecution is finished (i.e status is COMPLETED/ERROR) the associate row from monitoring table should be deleted. In case of data-import this is equivalent when 

  number of total chunks for jobExecution = getCurrentlySucceeded + getCurrentlyFailed. This purpose serves updateJobExecutionIfAllRecordsProcessed method in RecordProcessedEventHandlingServiceImpl class.

JobExecution monitoring

The main idea of monitoring is to have some job/timer/task inside of the system to watch on specific actions. It is proposed to use Watchdog timer for this purpose.

When the system detects the job stopped, it writes this information (including job_execution_id, and maybe more parameters, based on notification template requested) into logger, using a fixed predefined pattern and log level.

For example: log level = ERROR and message is "Data Import Job with jobExecutionId = %job_execution_id% not progressing"

After this step the monitoring job updates boolean flag in the database "notification_sent" to "true"

The environment (AWS, Kibana or what ever is used to monitor the installation) is set up to track those messages in a log. If the message is detected, it's split into tokens (ex. jobExectutionId) and the alert is generated.

The monitoring system should have a recipient/recipient group to be set up to send an email on this alert.

Requirements:

  • configurable start time - TBD time unit
QuestionAnswer
Time unit to monitor?for the v.1 default time is 20 min.
Log message level: ERROR; message: "Data Import Job with jobExecutionId = %job_execution_id% not progressing"
Notification channel  for v.1 it is dependent on customer external monitoring tool(AWS, Kibana, etc.)
Task ListJira high level estimation(story points)
  • Create monitoring table and insert record when jobExecution is created

MODSOURMAN-458 - Getting issue details... STATUS

5
  • Update monitoring table(updated_date) on change of jobExecution

MODSOURMAN-459 - Getting issue details... STATUS

5
  • Remove row from monitoring table once jobExecution is finished
3
  • Implement watchdog timer to monitor table 

MODSOURMAN-460 - Getting issue details... STATUS

8

For the v.1 of "Spike: Monitoring for data-import" it is decided that the monitoring(sending emails, maintaining receivers list) will be covered by external tools(AWS, Kibana, etc.) depends on the customer setup. 


Phase 2 Receivers list

Requirements:

  • E-mail list (configurable)


QuestionAnswer

How the receivers list will be populated?

  • DevOps setup them
  • some endpoint on UI
  • etc.

Task ListJira high level estimation(story points)




Phase 3 Sending emails

Requirements: Ann-Marie Breaux (Deactivated)to finalize requirements with librarians and add here, along with sample e-mail mockup

Message should include

  • Message date and time stamp (localized date and time)
  • Job number
  • User who started the job
  • File name
  • Job profile
  • Job start date and time
  • Job stop date and time (if it stopped)


QuestionAnswer
Sample mockup for email?


Task ListJira high level estimation(story points)



Configuration: 

  • No UI planned at this time
  • Vladimir Shalaevdocumenting configuration details for implementing in various environments:
  • Next steps: Will review with DevOps and then link here
  • Should this be directed at the hosting provider only (as first iteration)? Still TBD; will take more additional work if customized to individual tenants

Improvements:

  • per Oleksii Kuzminov - different levels of WARN messages might be considered in next implementation version
  • per Vladimir Shalaev - job stucked analyzer improvement - if all other job is also not updated within predefined unit of time - this might be considered as some module is about to restart.

Related Documents: