Data Export redesign
Role | Person | Comments |
---|---|---|
Solution Architect | ||
Java Lead | ||
UI Lead | Vadym Shchekotilin | |
Product Owner |
Summary
The design document presented below outlines a complete redesign of the Data Export module, aimed at supporting the export of various record types, such as Instances, Holdings, and Authorities, in different environments, including Single tenant, Multi-tenant, and Consortia. The primary goal is to handle the export of shared and local instances while ensuring that holdings and item data are correctly attached. The redesigned solution is intended to be scalable, extensible, and reliable, catering to various functional and non-functional requirements.
Requirements
Following features provide source for requirements:
Functional requirements
- Ability to export shared instance records from a tenant.
- Support for exporting with both default and custom mapping profiles.
- Generation of simplified MARC records for local and shared instances.
- Correct attachment of holdings and item data to instances during export with custom profiles.
- Ability to handle significant volumes of data (> 1,000,000 records) and export them into multiple files.
- Configurable maximum number of records per file.
- Parallel processing of multiple files of the same Data Export job.
- Retrieval of MARC records from local or central SRS storage based on the source.
- Storage of result files in AWS S3 buckets.
Non-Functional requirements
- Performance: The export process should be efficient and capable of handling large volumes of data without significant performance degradation. The data export process should not significantly impact the overall system performance. Data retrieval and enrichment operations should be parallelized to maximize throughput.
- Support up to 22,000,000 records;
- TODO: agree on the maximum time allowed to perform the export of 1,000,000 records
- The time growth should be almost linear when increasing the number of exported records.
- Scalability: The solution should scale horizontally to accommodate increasing data export demands and be able to handle a growing amount of data and tenants.
- Reliability: The export process should be reliable, ensuring that all data is exported accurately and completely. The solution should also provide reliable data export procedures and handle possible errors gracefully.
- Security: Proper access controls should be in place to ensure data confidentiality during the export process.
- Maintainability: The solution should be modular, well-documented, and easy to maintain and enhance in the future.
- Extensibility: The solution should be flexible enough to adopt future enhancements or changes.
Constraints
The solution should be implemented within the existing FOLIO environment and integrated with the mod-source-record-storage .
Compatibility with the current database schema and REST APIs should be maintained.
Any necessary migration scripts should be provided to ensure a smooth transition to the new solution.
- No changes to the UI part of the Data Export app are expected.
Activity Diagram
The activity diagram represents the main steps involved in the data export process for shared and local instances in a consortia environment. It visualizes the workflow, conditions, and iteration of steps, providing a comprehensive overview of how the export process works.
The diagram is divided into two main parts: the initial data export setup and the single file processing, which can be run in parallel.
Initial Data Export Setup
- The user uploads a file with IDs or a CQL query. This file is then stored in an S3-like storage.
- The user sends an ExportRequest to initiate the export process.
- The input file is downloaded from the S3-like storage to the local file system.
- If the file contains Ids, they are stored in the job_executions_export_ids table. Otherwise, the query is executed if it contains a CQL query, and the retrieved Ids are stored in the job_executions_export_ids table.
- The number of output files (N) is calculated based on the file size limit and ranges of Ids for every file.
- The corresponding records are created in the job_executions_export_files table.
Single File Processing (can be run in parallel)
- The state of the job_executions_export_file changes from "SCHEDULED" to "ACTIVE."
- A new output file is created in the local file system.
- An SQL query is executed to retrieve entities for the file (Instances or Holdings or Authorities). In this particular case it is Instances
- Three lists are created to hold records for data enrichment based on their source types: FOLIO, MARC, and CONSORTIUM - MARC.
- While there are more records in the ResultSet, each record is processed based on its source type. For each source type, the current record is added to the corresponding enrichment list. When the list is full, the relevant processing step is conducted (i.e., retrieving local/shared MARC records and Holdings/Items, linking Instances or MARC records with Holdings/Items, generating or enriching MARC records, and storing them in the output file).
- The processing is completed for each list (FOLIO, MARC, and CONSORTIUM - MARC).
- The output file is uploaded to the S3-like storage.
- The file location in the S3-like storage is updated in the job_executions_export_files record.
- The state of the job_executions_export_files record is changed from "ACTIVE" to "COMPLETED."
- When all files are ready, the Data Export Job is marked as "COMPLETED."
ER Diagram
The Entity Relationship diagram represents the database schema design for handling data export functionality in a consortia environment.
The diagram consists of two main sections: Existing DB Schema objects and New DB Schema objects. The existing schema objects are currently being used in the system, while the new schema objects will be added to support performance requirements.
Conceptual class diagram
The class diagram illustrates the main interfaces and classes that will be implemented as part of the solution for handling data export of local and shared instances in a consortia environment. The diagram aims to visualize the interactions and relationships between these components, ensuring a clear picture of the overall system design.
Interfaces:
- DataExportService: This interface defines the main methods for handling data export processes.
- InputFileProcessor: This interface defines the methods for uploading a file with IDs or CQL query and processing the input data.
- InputFileProcessorFactory: This interface defines the methods for creating instances of InputFileProcessor.
- IdSlicer: This interface defines the methods for slicing the IDs and preparing the output files based on the required settings.
- ExportStrategyFactory: This interface represents a factory for creating instances of different ExportStrategy implementations.
- ExportStrategy: This interface defines the methods for exporting instances, holdings, and authorities based on the chosen strategy.
Classes:
- DataExportServiceImpl: This class implements the DataExportService interface, providing the core functionality for handling data export processes.
- InputFileProcessorFactoryImpl: This class implements the InputFileProcessorFactory interface, providing instances of InputFileProcessor implementations (for ID and CQL-based input data).
- InputFileProcessorIdImpl: This class implements the InputFileProcessor interface and will handle the processing of input files containing IDs.
- InputFileProcessorCQLImpl: This class implements the InputFileProcessor interface and will handle the processing of input files containing CQL queries.
- IdSlicerImpl: This class implements the IdSlicer interface, preparing the output files by slicing the set of IDs defined by the source file based on the required settings.
- ExportStrategyFactoryImpl: This class implements the ExportStrategyFactory interface and provides instances of different ExportStrategy implementations.
- AuthorityExportStrategyImpl: This class implements the ExportStrategy interface, providing the logic for exporting authority records.
- HoldingExportStrategyImpl: This class implements the ExportStrategy interface, providing the logic for exporting holdings records.
- InstanceExportStrategyImpl: This class implements the ExportStrategy interface, providing the logic for exporting instance records.