[MODOAIPMH-492] Re-work asynchronous code for harvesting Created: 21/Mar/23 Updated: 08/Feb/24 Resolved: 22/Sep/23 |
|
| Status: | Closed |
| Project: | mod-oai-pmh |
| Components: | None |
| Affects versions: | None |
| Fix versions: | 3.12.0 |
| Type: | Story | Priority: | P3 |
| Reporter: | Viachaslau Khandramai (Inactive) | Assignee: | Oleksandr Bozhko |
| Resolution: | Done | Votes: | 0 |
| Labels: | back-end, firebird-release-notes-poppy | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Firebird - Sprint 172, Firebird - Sprint 170, Firebird - Sprint 171, Firebird - Sprint 173, Firebird - Sprint 174 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Story Points: | 5 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Development Team: | Firebird | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Release: | Poppy (R2 2023) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| RCA Group: | TBD | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Purpose/Overview: Requirements/Scope:
Approach: Acceptance criteria:
|
| Comments |
| Comment by Magda Zacharska [ 03/May/23 ] |
|
Viachaslau Khandramai to provide more details. |
| Comment by Oleksandr Bozhko [ 08/Aug/23 ] |
|
Hi Magda Zacharska, during the testing of full harvest on perf environment I noticed the following cases that slow down the process: |
| Comment by Oleksandr Bozhko [ 08/Aug/23 ] |
|
Follow up question: if some records in the response should be skipped due to some reason and, for example, from 200 only 197 can be returned, is it possible to return 197 and avoid collecting additional 3 records? |
| Comment by Magda Zacharska [ 08/Aug/23 ] |
|
Hi Oleksandr Bozhko all three issues you mentioned are considered a bad data and we see such examples in production as well. I understand that logging issues slows the harvest down but if we don't report the issues, libraries will not be able to address them. In general, it is OK to have fewer records than the value of MAX_RECORDS_PER_RESPONSE. We don't need to collect additional records so that the response contains exact number of records. Re: 1. Are the duplicates disregarded or added to the skipped records and added to metadata? That would be prefered behavior. I will try to gather some data about how often cases like that exist in production so that we have a better understanding how often the issues occur. |
| Comment by Oleksandr Bozhko [ 21/Aug/23 ] |
|
This story was verified on different environment including folio-testing-sprint, snapshot, snapshot-2, folio-perf. Now it supports deleted instances that were deleted through the API and dumped into audit table (both of FOLIO and MARC sources). The following are the results of different test runs: Test run:
Test run with more than 8 millions records (260k FOLIO), where 70k skipped:
Also there was a test run with 5 parallel full harvest executions on folio-testing-sprint environment (3 marc21_withholdings + 2 marc21). During the testing it was found that more than 1 instance of mod-oai-pmh/edge-oai-pmh module improves the overall performance of full harvesting. Such option needs to be tested additionally. Please note, this approach does not support the bad data when for 1 MARC instance in the inventory storage there are more than 1 record in the SRS. In this case, only one MARC record is harvested, the rest will be skipped. |
| Comment by Taras Spashchenko [ 22/Aug/23 ] |
|
In order to achieve better performance, I would propose the following approach regarding handling invalid instances/MARC records. We do not implement extra logic and checking during the harvesting, but we can develop an SQL script that reports all invalid instances/MARC records and share that with the administrators and hosting reams, so they can find them and then fix or remove them if they are not needed. |
| Comment by Yauheniya Kryshtafovich [ 28/Aug/23 ] |
|
Hi, Magda Zacharska and Oleksandr Bozhko . During verification of the story I was able to run on non-consortial sprint testing env tenant successful harvest for: |
| Comment by Yauheniya Kryshtafovich [ 28/Aug/23 ] |
|
Magda Zacharska and Oleksandr Bozhko the issue can't be reproduced after restart of edge-oai-pmh and mod-oai-pmh |
| Comment by Oleksandr Bozhko [ 20/Sep/23 ] |
|
GetRecord with source CONSORTIUM-FOLIO was verified on perf environment: 1) Find any FOLIO record and change source to CONSORTIUM-FOLIO: 2) Make a request with verb=GetRecord: Request with from/until verification: |
| Comment by Yauheniya Kryshtafovich [ 22/Sep/23 ] |
|
Hello Magda Zacharska the results of OAI-PMH harvests are added to the https://folio-org.atlassian.net/wiki/display/FOLIJET/OAI-PMH+Full+Harvests |