Tech Design: solution for migrating authority records
Summary
There exists significant difficulties with regards to the upgrading of a large set of MARC authority records when mapping rule changes have been applied, as well as migrating authority records from other systems to FOLIO. Updating authority records to apply mapping rules changes is a time-consuming process that takes multiple days, and using DI does not scale well when libraries are attempting to do other work with data import. Failure to update these records leads to data inconsistency. With the implementation of authority control/linking (manual) with Orchid, this issue may become even more problematic for libraries. As automated linking is introduced, this issue will become even more challenging.
The proposed solution addresses issues related to remapping and initial record migration in terms of data volumes and expected performance.
Requirements
Functional requirements
https://folio-org.atlassian.net/browse/ARCH-38
https://folio-org.atlassian.net/browse/ARCH-46
Non-functional requirements
Following https://docs.google.com/spreadsheets/d/10GiFrfZee8aY8PcE0JJxf-lWtMkddFWnOYo_tiKYXrs/edit#gid=0
The max number of records in the source file: 12 000 000+ records
Following the preliminary analysis of LoC the number of records could be more than 20 000 000+
Processing should be almost linear
Time spent on the migration of 10 000 MARC records: TBD
Time spent on the remapping of 10 000 records: TBD
Assumptions
This solution will be used by the administrative personnel and/or the hosting team, so the solution will not have UI forms (at least in the early stages) and will provide RESTful API only.
The solution is expected to use all allocated resources to complete the import operation in the shortest possible time. However, there should be a configuration option to define how many resources should be allocated or how many concurrently running migration processes should be simultaneously.
Limitations
The solution will use the default mapping profile for Authority records.
The solution will not support the complex Data-Import like matching functionality to find records that should be updated. In the case of remapping, the matching will be done simply based on the primary key values (Ids).
Running more than one import/remapping operation simultaneously is prohibited.
For the sake of simplicity and fault tolerance, all file operations will be done outside of the solution.
Side effects
The remapping operation will trigger the Authority reindexing automatically, so a separate authority reindexing will not be needed.
Implementation
The solution will be implemented as a separate FOLIO module.
The main idea is that processing large volumes of data could be done in two separate steps.
The first step is to read data from the source file or database, apply new mapping rules (using the existing data-import-processing-core library), and store the resulting entities in the file.
The next step is to deliver the file with entities to the appropriate back-end module and store them in the database one by one using the existing update implementation.
Notes
In order to support significant volumes of data, the Data Access Layer should be implemented using plain JDBC; Spring Data can't be used because of high memory consumption for such volumes of data and lower performance compared to the plain JDBC approach.
API
This module will provide the following RESTful endpoints:
start import/remapping operations
check import/remapping operations status
get operation errors
High-level operations overview
Detailed Sequence diagrams representing the Import and Remapping operations
ER diagram for the data model.
High-level component diagram