Performance improvement when ingesting MARC records into SRS, including bulk loads.

Problem statement

Data Import will be used for ingesting very large files for migration or data enrichment purposes. It needs to be faster and better at bulk loading. 

There is a link that represents the whole set of SRS Requirements: https://docs.google.com/spreadsheets/d/1maSexmOVNLB4xePsV382ZyNPlr7JSkfw49O2cScKpYg/edit#gid=447705499

Current state

The current solution was designed without aiming to support quite big volumes of data to be imported into FOLIO. Currently, imported records are processed one by one, potentially in parallel, but each processing step is a separate request and it adds unnecessary overhead to the whole process. It is acceptable for medium data sets (about thousands of records) but it takes too much time when it is required to load millions of records.

Requirements

Among the entire set of requirements given in the document mentioned above, the following relate to the problem considered here.

  • Data import performance should remain consistent across a large number of batches
  • Data import "Batch functionality" must be used by library staff on a regular basis
  • Data import should support a batch size of up to 10,000 records
  • Data import performance should be at least 300 records/sec ~ 34 sec per batch

Desired state

There are several solutions that allow us to satisfy requirements.

PostgreSQL Copycross (x)

COPY is a standard PostgreSQL command that moves data between PostgreSQL tables and standard file-system files.

Pros:

  • The best performance
  • Batch size is not limited
  • Indexes and constraints can be removed and recreated at the end of the loading process to improve performance
  • Import to Inventory and SRS can be run in parallel

Cons:

  • There is no consistency guarantee between Inventory and SRS
  • All data must be completely prepared beforehand in the form of tables’ rows to be stored in the DB.
  • Prepared files (CSV) must be accessible from a place where psql is run to perform the copy operation.
  • Too much manual effort required.
  • Advanced tech level required to run loads
  • Security considerations (direct access to DB is required with write permissions)

 Separate load (SRS and Inventory) via dedicated HTTP end-pointscross (x)

Pros

  • SRS loading and Inventory loading are separate processes and can be run in parallel
  • Good performance
  • Implementation in SRS is simple and straightforward
  • Simple HTTP Post request to supply a bunch of records that will be just stored in a DB using a bulk insert statement
  • Indexes and constraints can be removed and recreated at the end of the loading process to improve performance

Cons

  • There is no consistency guarantee between Inventory and SRS
  • Data sets must be preliminarily prepared.
  • The same InstanceIds and HRIDs must be provided for both SRS and Inventory
  • The average size for MARC (json) ~12 KB. Size of a batch of 10,000 records ~ 120 MB - quite expensive for the HTTP request.
  • Additional effort is required in order to repeat the logic implemented in the “Data Import” for converting MARC records into Inventory entities.

“All in one” load

Data Import provides complete functionality to load MARC records and create all necessary Inventory entities.

Pros:

  • Only MARC records are loaded in SRS. Separate loading of Instances, Holdings, and Items are not needed. 
  • Data loaded into SRS and Inventory are consistent
  • Can be used by library staff through the UI app on a regular basis
  • Instances, Holdings, and Items are created automatically using “Data Import” existing functionality with a dedicated Job Profile
  • All batches can be processed in parallel to utilize all available resources
  • Batch is stored in SRS using a bulk insert statement

Cons:

  • Batch size has to be smaller than 10,000 recs
  • Additional implementation complexity compared to previous options
  • Complex error handling at all stages of the import process
  • Indexes and constraints can’t be removed or disabled
  • Additional functionality to keep HRIDs in sync is needed


The third option is selected as it provides consistency and be easily used by library staff without extra effort.

The allowed limitation at least at the initial stage of implementation is that data is only inserted, search-and-update operations are not supported. Any existing records (MARCs, Instances, etc.) will be simply overwritten or substituted.

Moving deeper into technical details two solutions for the third option were identified.

Data files in the file system can be used as interim storage for Inventory entities before those are stored in Inventory.

Files can be created using FTP connections. Another option is NFS volumes mounted to servers where Data import runs and where Inventory runs.

The hardest design approach for this case is:

  1. Files with MARC records have been created and a volume with those files is mounted to the host where SRM instance resides.
  2. SRM receives the HTTP request with the location of files provided
  3. SRM uses existing implementation to read the file and create records in SRS, it does not require any additional development all this is already implemented (we need just skip the file upload step – it is already in place)
  4. At the same time, SRM creates Instances, holdings, and Item in the format required by Inventory, and stores them in separate files on the same mounted volume. UUIDs, InstanceIds, HRIDs are generated and stored as well both in SRS DB and in files.
  5. In case of any errors, MARC records should be stored in a separate file or DB for further analysis.
  6. Having that done, when all MARC records are processed, this volume must be mounted to the host where Inventory-Storage resides.
  7. Inventory-Storage receives the HTTP request with the location of files provided and simply copies records from files straight into the DB.
  8. Error analysis.

The weak point here is the interaction through the file system.

Batch processing with streaming (over HTTP)

  1. SRM receives a batch of MARC records.
  2. SRM creates/updates Job.
  3. SRM maps MARC records to Raw Records, Parsed Records, and Instances or Holdings. Should respect batch request DTO ids or fabricate them during mapping to achieve relationship between entities. This should be a single and only iteration over the records.
  4. SRM makes a batch create request to Inventory Storage for Instances or Holdings. Single request through Okapi and is proven performant.
  5. SRM batch create raw records request to SRS. SRS persists raw records in batch. Using PostgreSQL multiple-row inserts or Copy In. Single request through Okapi.
  6. SRM batch create parsed records request to SRS. SRS persists parsed records in batch. Using PostgreSQL multiple-row inserts or Copy In. Single request through Okapi.
  7. SRM updates Job.


This design is accepted by the SRS-Performance group and is selected as a baseline for implementation.

This is a high-level activity diagram that represents this solution.

The main question here is, can existing capabilities (OKAPI, RMB) support this approach? To make it clear it is proposed to create a PoC that implements the transport layer to demonstrate that this approach is correct and could be implemented for production scale.