[MODOAIPMH-492] Re-work asynchronous code for harvesting Created: 21/Mar/23  Updated: 08/Feb/24  Resolved: 22/Sep/23

Status: Closed
Project: mod-oai-pmh
Components: None
Affects versions: None
Fix versions: 3.12.0

Type: Story Priority: P3
Reporter: Viachaslau Khandramai (Inactive) Assignee: Oleksandr Bozhko
Resolution: Done Votes: 0
Labels: back-end, firebird-release-notes-poppy
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: JPEG File cons_folio.JPG     JPEG File from_until_inventory.JPG     JPEG File from_until_srs.JPG     JPEG File from_until_srs_and_inv.JPG     PNG File full_harvest_completed_ptf_host.png     JPEG File full_harvest_completed_remote_firebird_host.JPG     JPEG File get_record.JPG     PNG File image-2023-08-28-13-20-42-581.png     PNG File image-2023-08-28-13-21-19-400.png    
Issue links:
Blocks
is blocked by MODINVSTOR-1105 Create new field completeUpdatedDate ... Closed
Continues
continues MODOAIPMH-524 SRS-client with "shared" MARC records... Closed
Defines
defines UXPROD-4130 Improve OAI-PMH performance Closed
Relates
relates to FAT-7273 Review of C402333 test case (OAI_PMH ... Closed
relates to FAT-7274 Review of C402331 test case (OAI_PMH ... Closed
relates to FAT-7276 Review of C402361 test case (OAI_PMH ... Closed
relates to FAT-7277 Review of C402363 test case (OAI_PMH ... Closed
relates to FAT-7278 Review of C402367 test case (OAI_PMH ... Closed
relates to FAT-7280 Review of C402370 test case (OAI_PMH ... Closed
relates to FAT-7281 Review of C402372 test case (OAI_PMH ... Closed
relates to FAT-7282 Review of C402379 test case (OAI_PMH ... Closed
relates to FAT-7283 Review of C402375 test case (OAI_PMH ... Closed
relates to FAT-7284 Review of C402369 test case (OAI_PMH ... Closed
relates to FAT-7285 Review of C402373 test case (OAI_PMH ... Closed
relates to FAT-7286 Review of C402371 test case (OAI_PMH ... Closed
relates to FAT-7287 Review of C402364 test case (OAI_PMH ... Closed
relates to FAT-7289 Review of C405928 test case (OAI_PMH ... Closed
relates to FAT-7291 Review of C405929 test case (OAI_PMH ... Closed
relates to FAT-7292 Review of C406998 test case (OAI_PMH ... Closed
relates to MODOAIPMH-490 Inventory-client to views mechanism r... Closed
relates to MODOAIPMH-491 Implement query builder for the new a... Closed
Requires
requires MODINV-817 Shadow Instances - add support for Co... Closed
Sprint: Firebird - Sprint 172, Firebird - Sprint 170, Firebird - Sprint 171, Firebird - Sprint 173, Firebird - Sprint 174
Story Points: 5
Development Team: Firebird
Release: Poppy (R2 2023)
RCA Group: TBD

 Description   

Purpose/Overview:
The new approach to OAI-PMH harvest assumes to utilize the predefined views stored in OAI-PMH schema in the QueryBuilder, and make direct queries to database to get instance data or marc content.

Requirements/Scope:

  1. Refactor and replace existing logic with new approach with QueryBuilder.
  2. Test full harvest

Approach:
Review the existing code and integrate it with the new approach described in this document.

Acceptance criteria:

  • Full harvest works as expected
  • Records with sources MARC, FOLIO, CONSORTIUM-MARC or CONSORTIUM-FOLIO are included in incremental and full harvests
  • Unit tests are added/updated


 Comments   
Comment by Magda Zacharska [ 03/May/23 ]

Viachaslau Khandramai to provide more details.

Comment by Oleksandr Bozhko [ 08/Aug/23 ]

Hi Magda Zacharska, during the testing of full harvest on perf environment I noticed the following cases that slow down the process:
1. More than 1 MARC for 1 FOLIO record. In this case, the same ID of instance is repeated more 1 and in the code it is treated as duplicate and only first MARC is returned. For example, MAX_RECORDS_PER_RESPONSE = 10 and among them there are 3 instances with the same ID, so that response will contain 8 records (not 10). Is it correct behavior, or all 3 MARC records should be returned?
2. MARC instances without underlying SRC records. In this testing environment, there are approx. 70k such instances among 8 millions and they are evenly distributed all over the records. It means that for every 200 records there are 1-2 such bad instances. Currently they are treated as skipped and added to metadata, but in the new approach such bad data degrade performance and we agreed to remove them from the database. Is it still a solution and I can expect that there are no such a big amount of bad data in production?
3. MARC instances with underlying SRS and state = 'OLD'. This is like the previous issue, but with bad state. Can I consider such instances as a bad data as well?

Comment by Oleksandr Bozhko [ 08/Aug/23 ]

Follow up question: if some records in the response should be skipped due to some reason and, for example, from 200 only 197 can be returned, is it possible to return 197 and avoid collecting additional 3 records?

Comment by Magda Zacharska [ 08/Aug/23 ]

Hi Oleksandr Bozhko all three issues you mentioned are considered a bad data and we see such examples in production as well. I understand that logging issues slows the harvest down but if we don't report the issues, libraries will not be able to address them.

In general, it is OK to have fewer records than the value of MAX_RECORDS_PER_RESPONSE. We don't need to collect additional records so that the response contains exact number of records.

Re: 1. Are the duplicates disregarded or added to the skipped records and added to metadata? That would be prefered behavior.
Re: 2. We cannot assume that the production data will not have such cases, we cannot assume that we can delete them as we did in our data set. They need to be logged so that libraries can address the issues.
Re: 3. Definitely example of bad data.

I will try to gather some data about how often cases like that exist in production so that we have a better understanding how often the issues occur.

Comment by Oleksandr Bozhko [ 21/Aug/23 ]

This story was verified on different environment including folio-testing-sprint, snapshot, snapshot-2, folio-perf.

Now it supports deleted instances that were deleted through the API and dumped into audit table (both of FOLIO and MARC sources).

The following are the results of different test runs:

Test run:

  • recordsSource = Source records storage and Inventory
  • maxRecordsPerResponse = 500
  • verb = ListRecords
  • metadataPrefix = marc21_withholdings
  • processed 8111185 records
  • skipped 69682 records
    Total time: 15 h 25 min

Test run with more than 8 millions records (260k FOLIO), where 70k skipped:

  • recordsSource = Source records storage and Inventory
  • maxRecordsPerResponse = 300
  • verb = ListRecords
  • metadataPrefix = marc21
  • processed 8111185 records
  • skipped 69682 records
    Total time: 10 h 09 min

Also there was a test run with 5 parallel full harvest executions on folio-testing-sprint environment (3 marc21_withholdings + 2 marc21). During the testing it was found that more than 1 instance of mod-oai-pmh/edge-oai-pmh module improves the overall performance of full harvesting. Such option needs to be tested additionally.

Please note, this approach does not support the bad data when for 1 MARC instance in the inventory storage there are more than 1 record in the SRS. In this case, only one MARC record is harvested, the rest will be skipped.

Comment by Taras Spashchenko [ 22/Aug/23 ]

In order to achieve better performance, I would propose the following approach regarding handling invalid instances/MARC records.

We do not implement extra logic and checking during the harvesting, but we can develop an SQL script that reports all invalid instances/MARC records and share that with the administrators and hosting reams, so they can find them and then fix or remove them if they are not needed.

Comment by Yauheniya Kryshtafovich [ 28/Aug/23 ]

Hi, Magda Zacharska and Oleksandr Bozhko . During verification of the story I was able to run on non-consortial sprint testing env tenant successful harvest for:
1. SRS + Inventory, verb = List Records, metadataPrefix = marc21_withholdings:

2. Inventory, verb = List Records, metadataPrefix = marc21_withholdings:

But from time to time harvest gets stuck from the very beginning and does not start at all, when User send request in parallel or in parallel with otherĀ  Tenant

Comment by Yauheniya Kryshtafovich [ 28/Aug/23 ]

Magda Zacharska and Oleksandr Bozhko the issue can't be reproduced after restart of edge-oai-pmh and mod-oai-pmh

Comment by Oleksandr Bozhko [ 20/Sep/23 ]

GetRecord with source CONSORTIUM-FOLIO was verified on perf environment:

1) Find any FOLIO record and change source to CONSORTIUM-FOLIO:

2) Make a request with verb=GetRecord:

Request with from/until verification:
1) Go to Settings -> OAI-PMH -> Behavior and change Records source to Source records storage and Inventory.
2) Make a request with from=2023-09-19 and until=2023-09-19:
3)
4) Go to Settings -> OAI-PMH -> Behavior and change Records source to Source records storage.
5) Make a request with from=2023-09-19 and until=2023-09-19:

6) Go to Settings -> OAI-PMH -> Behavior and change Records source to Inventory.
7) Make a request with from=2023-09-19 and until=2023-09-19:

Comment by Yauheniya Kryshtafovich [ 22/Sep/23 ]

Hello Magda Zacharska the results of OAI-PMH harvests are added to the https://folio-org.atlassian.net/wiki/display/FOLIJET/OAI-PMH+Full+Harvests

Generated at Fri Feb 09 00:37:40 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.