ARCH-36 Provide a way to update MARC authority records when mapping rules have changed
- 1 Jira links
- 2 Overview
- 3 Scope
- 4 Disadvantages of existing Script
- 5 Solutions
- 5.1 Diagram of current solution
- 5.2 Solution 1 - Master-script and component affine scripts (M-Size)
- 5.3 Solution 2 - Embedded (L-Size)
- 5.3.1 Diagram
- 5.4 Implementation details for embedded solution
- 5.5 Solution 3 - Current script and codebase enhancements to deal with batches (L-Size)
- 5.5.1 Concerns
- 5.6 Solution 4 - run affected records through regular update DI flow (S-Size)
- 5.6.1 Steps
- 5.6.2 Diagram
- 5.6.3 Scalability
- 5.6.4 Concerns
- 6 Delivery Plan and LOE
- 6.1 Solution 4 (S)
- 6.2 Addressed Concerns
- 6.2.1 Confirm the MARC field used for matching
- 6.2.2 Confirm that the user will have access to any errors through DI logs
- 6.2.3 Confirm that the Job Profile will state <<Release name>> MARC authority upgrade
- 6.2.4 Confirm that a setting will be available that allows the System Admin or Hosting Provider to set the number of authority records to include in batch file to update
- 6.2.5 Confirm that we will provide documentation NOT just a release note about this process
- 6.2.6 Confirm that authority control linking to bib records and this process is support with Option 4
- 6.2.7 Also need to create another ARCH story for longer term solution that we aim to implement for Poppy or Queen Anne's Lace release.
- 7 Rationale
Jira links
https://folio-org.atlassian.net/browse/ARCH-36
Overview
According to the raised issue, there is a problem with performance on updating authorities (and in near future once released all linked bib records) when authorities mapping rules update happens. As mod-source-record-manager codebase logic written, it is not possible to change authority mapping rules via REST API. Hence, authority mapping rules are changed only during migration from old to new release. Currently, there is a script (derived from original one) tailored to update affected authorities but in general embedded solution should be elaborated and script is supposed to be considered as a relevant back-up option. Thus, one is also considered to update to support linked bibs records update (that will be relevant starting from migration of Orchid to Poppy and further).
Scope
Embedded solution to be elaborated to properly apply authorities mapping rules updates.
Existing Script also is considered to be updated to support corresponding changes on linked bib records when authorities are updated after mapping rules change.
Performance of either existing script or embedded solution is not/will not be enough (as updating of 2.6M of records will take 1 month of 24/7 script working in good scenario - 1sec per record).
Disadvantages of existing Script
Existing UpdateMarcToAuthorityMapping.sh script is not optimal for big amount updates as well as linked bibliographic records handling.
Summary of disadvantages:
should be performed manually with specifying limits to handle records each time of script run
slow as handles version updates by calling /change-manager REST API for each record one-by-one two time. First on retrieve record (GET), second for update one (PUT). Needs to implement batch approach.
case with future handling of linked bib records is not considered (likely handling will be out of script and managed by e.g. change-manager)
Solutions
Write another master script (and components affine scripts) that will directly interact with the Databases and indexes and alter corresponding authorities w/o excessive HTTP calls and DB transactions.
Measured baseline performance of Postgres Database on simple update of 2.6M authority records (add / remove chars in arbitrary column value) is 10,000 records per second
or 3 minutes 50 seconds took to update 2.6M of authority records (synthetic test). In case of pg json embedded interaction capabilities performance of updates such values could be lower but still effective (even in case of 10 times degradation full update of 2.6 millions of records will take less than 1 hour per table. In case of several components affected it can take up to e.g. 5 hours).NOTE: to be checked which components require corresponding updates (like ES indexes and others) but still it is a significant performance boost in case of direct bulk update.
To be performed in non-working hours to avoid concurrency (locking) issues
Write embedded solution that will perform updates on mod-srm initialization (/postTenant callback) but in bulk (reimplementation of update is required to deal with batch processing)
Advantages for Clients - update of authorities is performed under the hood and out of the box on module initialization
Concerns: probably more visibility on updates required from Customer's point of view. Feasible by adding progression on UI as progress is more trackable than in case of running external scripts.
Enhance script to deal with chunks - (still codebase enhancements are needed to deal with batch processing)
still performance affected by extensive HTTP calls and potential Out Of Memory Exceptions in case of limited RAM resources allocated for script execution and big bunch of records fetched / processed at the same time.
changes required to do on mod-srm as well as on all modules in a chain of update (mod-srs, mod-inventory-storage, mod-search) to support bulk update
Write another script or standalone application for retrieving records from mod-srs and submit in chunks to DI via its API.
scalability is possible by increasing srm, srs and inventory instances and partitions matched to consumer groups (performance boost)
performance still could be not enough for 10M+ records, so solution 1 with direct data processing could be considered.
Diagram of current solution
Solution 1 - Master-script and component affine scripts (M-Size)
Advantages of this on-top solution is that no internal codebase change is required. Only external scripts will deal with corresponding components.
Diagram
Concerns
Security - Direct access of scripts to core components data sources (database, ES indexes). Recommendations: - make sure that authorized admin runs the scripts.
Data consistency - In case of wrong usage data could turn in inconsistent state. Recommendation: - not to interrupt scripts running until finished. - do not concurrently change records until scripts finished. - track progress properly
Solution 2 - Embedded (L-Size)
Diagram
Implementation details for embedded solution
New logic implementation on mod-srm to be outlined on application startup.
mod-source-record-manager implied changes
mod-source-record-storage implied changes
mod-inventory-storage implied changes
mod-search implied changes
Peculiarities of implementation (to be described after Solution 4 implementation)
Solution 3 - Current script and codebase enhancements to deal with batches (L-Size)
Implies Solution 2 codebase enhancements with corresponding interactions with newly created Bulk APIs.
The difference between solution two is that current one does not suppose automatic launch of updating authorities on module initialization. But implies batch processing from script to API.
Concerns
There could be Out Of Memory errors in case if running script allocated limited resources and simultaneous records amount is overwhelming. Recommendation: to determine proper resource allocation and batch size limits.
Solution 4 - run affected records through regular update DI flow (S-Size)
Steps
Write new Standalone Java Application that will take all Marc authority records from mod-srs and pass to Data Import mechanism in bulk chunks (e.g.: by 50,000 of records). Also, the application must create temporary Update Import Profile (with further remove on update finished) and creating statistics of successful and failed items
Measure performance of handling 1M and 10M records
Consider improvement the approach to directly change items on mod-inventory and mod-search via batch handling.
Diagram
Scalability
It's possible to proceed with performance tweaks by increasing
Database connection pool and memory allocated
instances of mod-srm, mod-srs and mod-inventory
corresponding increase of Kafka partitions to scaled modules instances