Spike: Complete jobs that cannot update DB after failure
Description
Environment
Potential Workaround
relates to
Checklist
hideTestRail: Results
Activity

Kruthi Vuppala July 24, 2020 at 2:55 PM
Implementation story: https://folio-org.atlassian.net/browse/MDEXP-246

Kruthi Vuppala July 24, 2020 at 2:33 PMEdited
Had a discussion with available backend devs, we agreed on 2nd approach, will create an implementation story which will just add unit tests to the PR created in scope of this story. Also one good point brought up by was that we need migration scripts to be written to add the new field to existing jobs. That will also be done in the scope of the implementation story

Illia Borysenko July 24, 2020 at 12:51 PM
I also like the second approach to use okapi interface more, approve from my side

Kruthi Vuppala July 21, 2020 at 7:10 PMEdited
Also here is the PR with code for 2nd approach. If this approach seems ok, I will create a user story to formalize this process:
https://github.com/folio-org/mod-data-export/pull/127

Kruthi Vuppala July 21, 2020 at 3:43 PMEdited
Observations/recommendations:
To complete the job stuck in In_progress state, we can run a periodic process to check if there are any jobs that are stuck and then FAIL then.
This can be done in 2 ways
1. Using the vertx.setPeriodic() method
We can set the time to an hour or 2 to check for any jobs that haven't been updated in 1-2 hours and then update the job status to FAIL. We are already using this process to cleanup the uploaded files and generated files every hour. This will be an additional timer
2. Using the okapi timer interface
okapi provides a timer interface(https://github.com/folio-org/okapi/blob/master/doc/guide.md#timer-interface) where okapi calls the endpoint periodically , the values are specified in the module descriptor. This endpoint can be set to be run every 1 or 2 to do the update to the stuck jobs.
Advantages of using okapi interface over periodic job:
1. More control, in the future if we need to change the frequency of cleanup, we need not touch the code, just the module descriptor needs to be changed
2. This endpoint can also be called manually , instead of having to wait for the timer to run, in case it is required in some environments
Disadvantages of using okapi interface over periodic job:
1. We have had instances in the past where timer job failed to trigger, however it was due to okapi issues. In this case we also have a separate API as a work around if needed
2. vertx job just needs a piece of code added, whereas timer interface requires an API wrapper around it to be called
The suggestion is to use the 2nd approach(using okapi timer interface)
Details
Assignee
Kruthi VuppalaKruthi VuppalaReporter
Magda ZacharskaMagda ZacharskaLabels
Priority
P3Story Points
3Sprint
NoneDevelopment Team
ConcordeTestRail: Cases
Open TestRail: CasesTestRail: Runs
Open TestRail: Runs
Details
Details
Assignee

Reporter

Purpose/Overview:
Requirements/Scope:
During the tests in bugfest environment, the database was restarted shortly after the data export started. As a result, the job was stuck in the in Running state. The log showed
After this exception, the logs indicate that the job infact completed with error saying that there were no records to export because neither SRS nor inventory returned any records because of above exception. However, as the DB was not available the status in job_execution table could not be updated, there by leaving the job in IN_PROGRESS state.
Approach:
Acceptance criteria:
come up with a way to change the status of the job if it is stuck in IN_PROGRESS state for very long
In cases that the DB is not reachable the job should fail and appropriate error logs contains valid information.
Ideally, the job should attempt to get the new token and continue the job (if this is possible)
Document findings