Spike: Complete jobs that cannot update DB after failure

Description

Purpose/Overview:

Requirements/Scope:
During the tests in bugfest environment, the database was restarted shortly after the data export started. As a result, the job was stuck in the in Running state. The log showed

After this exception, the logs indicate that the job infact completed with error saying that there were no records to export because neither SRS nor inventory returned any records because of above exception. However, as the DB was not available the status in job_execution table could not be updated, there by leaving the job in IN_PROGRESS state.

Approach:

Acceptance criteria:

come up with a way to change the status of the job if it is stuck in IN_PROGRESS state for very long
In cases that the DB is not reachable the job should fail and appropriate error logs contains valid information.
Ideally, the job should attempt to get the new token and continue the job (if this is possible)
Document findings

Environment

None

Potential Workaround

None

Linked work items

relates to

MDEXP-470

Export fails after /data-export/expire-jobs endpoint invocation

MDEXP-246

Implement Expire stuck jobs

Checklist

hide

TestRail: Results

Activity

Show:

Kruthi Vuppala July 24, 2020 at 2:55 PM

Implementation story: https://folio-org.atlassian.net/browse/MDEXP-246

Kruthi Vuppala July 24, 2020 at 2:33 PM
Edited

Had a discussion with available backend devs, we agreed on 2nd approach, will create an implementation story which will just add unit tests to the PR created in scope of this story. Also one good point brought up by was that we need migration scripts to be written to add the new field to existing jobs. That will also be done in the scope of the implementation story

Illia Borysenko July 24, 2020 at 12:51 PM

I also like the second approach to use okapi interface more, approve from my side

Kruthi Vuppala July 21, 2020 at 7:10 PM
Edited

Also here is the PR with code for 2nd approach. If this approach seems ok, I will create a user story to formalize this process:
https://github.com/folio-org/mod-data-export/pull/127

Kruthi Vuppala July 21, 2020 at 3:43 PM
Edited

Observations/recommendations:
To complete the job stuck in In_progress state, we can run a periodic process to check if there are any jobs that are stuck and then FAIL then.
This can be done in 2 ways
1. Using the vertx.setPeriodic() method
We can set the time to an hour or 2 to check for any jobs that haven't been updated in 1-2 hours and then update the job status to FAIL. We are already using this process to cleanup the uploaded files and generated files every hour. This will be an additional timer

2. Using the okapi timer interface
okapi provides a timer interface(https://github.com/folio-org/okapi/blob/master/doc/guide.md#timer-interface) where okapi calls the endpoint periodically , the values are specified in the module descriptor. This endpoint can be set to be run every 1 or 2 to do the update to the stuck jobs.

Advantages of using okapi interface over periodic job:
1. More control, in the future if we need to change the frequency of cleanup, we need not touch the code, just the module descriptor needs to be changed
2. This endpoint can also be called manually , instead of having to wait for the timer to run, in case it is required in some environments

Disadvantages of using okapi interface over periodic job:
1. We have had instances in the past where timer job failed to trigger, however it was due to okapi issues. In this case we also have a separate API as a work around if needed
2. vertx job just needs a piece of code added, whereas timer interface requires an API wrapper around it to be called

The suggestion is to use the 2nd approach(using okapi timer interface)

Done

Details
Assignee
Kruthi Vuppala
Reporter
Magda Zacharska
Labels
data-export-q3-2020
Priority
P3
Story Points
3
Sprint
None
Development Team
Concorde
TestRail: Cases
Open TestRail: Cases
TestRail: Runs
Open TestRail: Runs

Created June 23, 2020 at 5:46 PM

Updated November 4, 2021 at 8:36 PM

Resolved July 24, 2020 at 6:38 PM

TestRail: Cases

TestRail: Runs

Spike: Complete jobs that cannot update DB after failure

Description

Environment

Potential Workaround

Linked work items

relates to

Checklist

TestRail: Results

Activity

Kruthi Vuppala July 24, 2020 at 2:55 PM

Kruthi Vuppala July 24, 2020 at 2:33 PMEdited

Illia Borysenko July 24, 2020 at 12:51 PM

Kruthi Vuppala July 21, 2020 at 7:10 PMEdited

Kruthi Vuppala July 21, 2020 at 3:43 PMEdited

DetailsAssigneeKruthi VuppalaKruthi VuppalaReporterMagda ZacharskaMagda ZacharskaLabelsdata-export-q3-2020PriorityP3Story Points3SprintNone+1Development TeamConcordeTestRail: CasesOpen TestRail: CasesTestRail: RunsOpen TestRail: Runs

Details

Assignee

Reporter

Labels

Priority

Story Points

Sprint

Development Team