ARCH-36 Provide a way to update MARC authority records when mapping rules have changed
Jira links
- ARCH-36Getting issue details... STATUS
Overview
According to the raised issue, there is a problem with performance on updating authorities (and in near future once released all linked bib records) when authorities mapping rules update happens. As mod-source-record-manager codebase logic written, it is not possible to change authority mapping rules via REST API. Hence, authority mapping rules are changed only during migration from old to new release. Currently, there is a script (derived from original one) tailored to update affected authorities but in general embedded solution should be elaborated and script is supposed to be considered as a relevant back-up option. Thus, one is also considered to update to support linked bibs records update (that will be relevant starting from migration of Orchid to Poppy and further).
Scope
Embedded solution to be elaborated to properly apply authorities mapping rules updates.
Existing Script also is considered to be updated to support corresponding changes on linked bib records when authorities are updated after mapping rules change.
Performance of either existing script or embedded solution is not/will not be enough (as updating of 2.6M of records will take 1 month of 24/7 script working in good scenario - 1sec per record).
Disadvantages of existing Script
Existing UpdateMarcToAuthorityMapping.sh script is not optimal for big amount updates as well as linked bibliographic records handling.
Summary of disadvantages:
- should be performed manually with specifying limits to handle records each time of script run
- slow as handles version updates by calling /change-manager REST API for each record one-by-one two time. First on retrieve record (GET), second for update one (PUT). Needs to implement batch approach.
- case with future handling of linked bib records is not considered (likely handling will be out of script and managed by e.g. change-manager)
Solutions
- Write another master script (and components affine scripts) that will directly interact with the Databases and indexes and alter corresponding authorities w/o excessive HTTP calls and DB transactions.
- Measured baseline performance of Postgres Database on simple update of 2.6M authority records (add / remove chars in arbitrary column value) is 10,000 records per second
or 3 minutes 50 seconds took to update 2.6M of authority records (synthetic test). In case of pg json embedded interaction capabilities performance of updates such values could be lower but still effective (even in case of 10 times degradation full update of 2.6 millions of records will take less than 1 hour per table. In case of several components affected it can take up to e.g. 5 hours). - NOTE: to be checked which components require corresponding updates (like ES indexes and others) but still it is a significant performance boost in case of direct bulk update.
- To be performed in non-working hours to avoid concurrency (locking) issues
- Measured baseline performance of Postgres Database on simple update of 2.6M authority records (add / remove chars in arbitrary column value) is 10,000 records per second
- Write embedded solution that will perform updates on mod-srm initialization (/postTenant callback) but in bulk (reimplementation of update is required to deal with batch processing)
- Advantages for Clients - update of authorities is performed under the hood and out of the box on module initialization
- Concerns: probably more visibility on updates required from Customer's point of view. Feasible by adding progression on UI as progress is more trackable than in case of running external scripts.
- Enhance script to deal with chunks - (still codebase enhancements are needed to deal with batch processing)
- still performance affected by extensive HTTP calls and potential Out Of Memory Exceptions in case of limited RAM resources allocated for script execution and big bunch of records fetched / processed at the same time.
- changes required to do on mod-srm as well as on all modules in a chain of update (mod-srs, mod-inventory-storage, mod-search) to support bulk update
- Write another script or standalone application for retrieving records from mod-srs and submit in chunks to DI via its API.
- scalability is possible by increasing srm, srs and inventory instances and partitions matched to consumer groups (performance boost)
- performance still could be not enough for 10M+ records, so solution 1 with direct data processing could be considered.
Diagram of current solution
Solution 1 - Master-script and component affine scripts (M-Size)
Advantages of this on-top solution is that no internal codebase change is required. Only external scripts will deal with corresponding components.
Diagram
Concerns
Security - Direct access of scripts to core components data sources (database, ES indexes). Recommendations: - make sure that authorized admin runs the scripts.
Data consistency - In case of wrong usage data could turn in inconsistent state. Recommendation: - not to interrupt scripts running until finished. - do not concurrently change records until scripts finished. - track progress properly
Solution 2 - Embedded (L-Size)
Diagram
Implementation details for embedded solution
New logic implementation on mod-srm to be outlined on application startup.
mod-source-record-manager implied changes
mod-source-record-storage implied changes
mod-inventory-storage implied changes
mod-search implied changes
Peculiarities of implementation (to be described after Solution 4 implementation)
Solution 3 - Current script and codebase enhancements to deal with batches (L-Size)
Implies Solution 2 codebase enhancements with corresponding interactions with newly created Bulk APIs.
The difference between solution two is that current one does not suppose automatic launch of updating authorities on module initialization. But implies batch processing from script to API.
Concerns
There could be Out Of Memory errors in case if running script allocated limited resources and simultaneous records amount is overwhelming. Recommendation: to determine proper resource allocation and batch size limits.
Solution 4 - run affected records through regular update DI flow (S-Size)
Steps
- Write new Standalone Java Application that will take all Marc authority records from mod-srs and pass to Data Import mechanism in bulk chunks (e.g.: by 50,000 of records). Also, the application must create temporary Update Import Profile (with further remove on update finished) and creating statistics of successful and failed items
- Measure performance of handling 1M and 10M records
- Consider improvement the approach to directly change items on mod-inventory and mod-search via batch handling.
Diagram
Scalability
It's possible to proceed with performance tweaks by increasing
- Database connection pool and memory allocated
- instances of mod-srm, mod-srs and mod-inventory
- corresponding increase of Kafka partitions to scaled modules instances
Concerns
- failed items is not defined how to handle - if it's need to retry import or just generate report, retrievable for user
- in terms of scalability it's required properly configure Kafka partitions matched to consumer groups 1-1. There won't be any performance boost if something misconfigured. Extra observations required
- concurrent records modifications by another processes
- mod-srs step is considered as redundant step in this case
Delivery Plan and LOE
Solution 4 (S)
- Write new Standalone Application (1 sprint)
- Measure Performance (0.5 sprint)
- Consider improvements (0.5 sprint)
Addressed Concerns
Confirm the MARC field used for matching
Matching is supposed to be done via 999 ff$s.
Confirm that the user will have access to any errors through DI logs
As sending records in batch on via DI is performed with the regular flow of import process, import jobs are created correspondingly. Hence reports are created on a regular basis.
Confirm that the Job Profile will state <<Release name>> MARC authority upgrade
During implementation of the Standalone Application a new Story will be created to describe the foregoing details on creation of temporary Update Profile.
Confirm that a setting will be available that allows the System Admin or Hosting Provider to set the number of authority records to include in batch file to update
During implementation of the Standalone Application a new Story will be created to describe the foregoing details on pointing program arguments as input params during application launch or (in case of UI App) corresponding input text field will be shown to setup batch size. Default is supposed to be 50,000.
Confirm that we will provide documentation NOT just a release note about this process
During implementation of the Standalone Application a new Story will be created to cover Customers documentation as a User Guide.
Confirm that authority control linking to bib records and this process is support with Option 4
After analysis the following outcomes have come up:
1) linkage (according to the schema) is done on specifically source records (mod-source-record-storage)
2) source records retain the same during mapping rules update
3) only records within mod-inventory-storage are changed according to change of mapping rules
4) linking rules on mod-entities-links sources will either retain the same
Withstanding the foregoing stored linkage of records is supposed to retain the same without changes as records within mod-source-record-storage are kept unchanged.
Also need to create another ARCH story for longer term solution that we aim to implement for Poppy or Queen Anne's Lace release.
Solution 2 implies codebase modifications (rather than built on-top without codebase change). Is considered as mostly efforts consuming approach but is the best for end Customers from user experience perspective.
Rationale
Solution 4 seems like the fastest one in implementation that covers performance expectations.