MARC Migration Documentation
Introduction
MARC Migration involves the process of updating and transferring MARC records and related FOLIO records within the FOLIO system, particularly when mapping rules change. This documentation is intended for developers, system administrators, and library personnel involved in the management and migration of MARC records. It provides comprehensive guidelines on utilizing the MARC Migration API, handling known limitations, optimizing performance, and troubleshooting common issues.
Use Cases and Scenarios
Scenario: MARC Records Remapping
When mapping rules for MARC bibliographic or authority records are updated, a comprehensive process is initiated to ensure that all MARC source records are accurately remapped to their corresponding FOLIO records (either instances or authorities). This remapping process ensures that the records in FOLIO reflect the latest mapping rules without altering the original MARC source records. The following steps outline the procedure for remapping MARC records:
Register the MARC Migration Operation: Initiate the migration by registering a new operation via the MARC Migration API. This operation will encompass all necessary steps to process the MARC records according to the updated mapping rules.
Track the Operation: Continuously monitor the progress of the migration operation through the API. Track the operation until the data mapping phase is complete, ensuring that all MARC records have been successfully mapped to the new format as per the updated rules.
Initiate the Data Saving Phase: Once the mapping phase is verified to be complete and successful, initiate the data saving phase through the API. This step will update the corresponding FOLIO records with the new mapping data derived from the MARC sources.
Continuous Monitoring: Keep monitoring the operation until it is fully complete. This includes verifying that all FOLIO records have been updated and that the system's integrity and consistency are maintained post-migration.
API Usage Guide
Making API Requests
To interact with the MARC Migration API, users must send HTTP requests to the appropriate endpoints. Each request must include proper authentication and adhere to the specified request format.
Authentication
Authentication is managed through the use of X-Okapi-Token. Include your token in the header of your API requests to authenticate:
X-Okapi-Token: <your_access_token>
Ensure that the authenticated user has the required permissions to perform operations:
marc-migrations.operations.item.post
for initiating migration operations.marc-migrations.operations.item.put
for initiating data saving phase of migration operations.marc-migrations.operations.item.get
for retrieving the status of migration operations.
Sample API Calls
Registering a Migration Operation:
POST /marc-migrations Content-Type: application/json X-Okapi-Token: <your_access_token> { "entityType": "authority", "operationType": "remapping" }
This call registers a new MARC Migration operation. The
operationId
needed for subsequent calls will be provided in the response of this request. TheentityType
can be either"authority"
or"instance"
depending on the records being migrated.Tracking Migration Operation:
GET /marc-migrations/{operationId} X-Okapi-Token: <your_access_token>
Replace
{operationId}
with the ID received from the response of the POST call. This endpoint allows you to track the progress and status of an ongoing MARC Migration operation. Possible statuses include:"new"
: The operation has been initialized but not yet started."data_mapping"
: The operation is currently mapping data."data_mapping_completed"
: Data mapping is complete."data_mapping_failed"
: Data mapping has failed."data_saving"
: Data is currently being saved."data_saving_completed"
: Data saving is complete."data_saving_failed"
: Data saving has failed.
Initiating Data Saving Phase:
Use this call to initiate the data saving phase for a MARC Migration operation once the data mapping phase is complete. Ensure to replace
{operationId}
with the valid ID from your registered operation.publishEvents
field defines if domain events should be published. On big datasets, it is recomended not to publish domain events but to use re-index to index changes introduced during migration, as it will be more performant.
Known Limitations
The MARC Migration system currently has several limitations that users should be aware of when planning and executing migrations:
Migration Timing
Maintenance Hours: Migration should be executed during maintenance hours when there are no interactions with instances or authority records. This helps to avoid conflicts and ensures system stability during the migration process.
No Horizontal Scalability
The system is designed to handle migration jobs sequentially with only one application instance. This means that:
Sequential Job Processing: Only one migration job can be processed at a time. If multiple migrations are initiated, subsequent jobs will be queued until the current job is completed.
Single Instance: The system does not support running multiple instances of the MARC Migration application simultaneously, which limits the ability to scale horizontally to handle larger loads or concurrent migrations.
No Failover Mechanism
In the current setup, there is no failover mechanism. This has the following implications:
Risk of Data Loss: If the application crashes during a migration, any files being processed at that time will be lost and the operation should be restarted.
Job Interruption: A crash or failure in the application can cause the current migration job to become stuck.
Performance-Based Configurations
Based on extensive performance tests involving over 16 million records during the Ramsons release, detailed in PERF-929, we have identified optimal configurations and strategies to enhance the performance of MARC migrations.
The default configuration for MARC migration took approximately 9.5 hours to complete, with 7 hours dedicated to data mapping and 2.5 hours to data saving.
The most efficient test utilized a CHUNK_FETCH_IDS_COUNT
of 12,000 and a RECORDS_CHUNK_SIZE
of 4,000, reducing the total migration duration to about 4 hours (3 hours 35 minutes for data mapping and 27 minutes for data saving).
Recommendations
Resource Allocation and Usage: This module is a utility designed for administrators to perform remapping tasks; it is not a standard module meant for continuous operation. Its primary function is to execute remapping during updates or upon request. Therefore, there is no need to limit its resource consumption. Instead, it is recommended to allocate the maximum amount of resources that the module can effectively utilize. Once the remapping process is complete and the module is no longer needed, it can be safely turned off.
Additional File Space: The path to the folder where files will be stored is configured through an environment variable
LOCAL_FILE_STORAGE_PATH
. Administrators should specify the path to file resources where there is sufficient free space. Furthermore, necessary memory space needs to be reserved in order to store the files during the migration process. The space utilized by migration process of Authority and Instance records is calculated as follows:where
NUMBER_OF_AUTHORITY_RECORDS
andNUMBER_OF_INSTANCE_RECORDS
are the total number of authority and instance records, and the values2200
and5000
bytes are the average size of the authority and instance records respectively. Since it is possible to run only one migration process at a time, the space to be reserved for the module is equal to the maximum of the above two values:Optimal Chunk Sizes: Use
CHUNK_FETCH_IDS_COUNT=12000
andRECORDS_CHUNK_SIZE=4000
to decrease migration time. Note that this configuration may cause mod-entities-links to use an additional 25% CPU.Performance Optimization and Dependencies: Remapping operations are parallelized within a single instance of the module. By removing CPU limitations and allocating 8 GB of RAM, you can significantly enhance its performance. Since the module writes data through direct calls to
mod-inventory-storage
, it's important to increase the number ofmod-inventory-storage
andmod-entities-links
instances to prevent any bottlenecks. The optimal number of module instances depends on the resources allocated tomod-marc-migrations
and should be determined through performance testing.Data Handling: While data mapping runs, files are stored directly in the working mod-marc-migrations container and later moved to an S3 bucket. If no S3 bucket is provided, data mapping will fail. If the container fails during data mapping, all files will be lost, and the mapping process will hang indefinitely.
Troubleshooting and FAQs
Handling Migration Failures
If any step of the migration process fails, the system is designed to save files detailing the causes of failure to an S3 bucket. To identify and analyze these failures, follow these steps:
Locate Failure Files: Check the database in the
operation_chunk_step
table. Theentity_error_chunk_file_name
column contains the name of the file that holds the failed entities, while theerror_chunk_file_name
column contains the name of the file with the causes of the failures.Analyze Failure Files: Retrieve and examine the specified files from the S3 bucket to understand what went wrong during the migration process.
Fix Issues: Based on the analysis, make the necessary corrections to the affected entities. This may involve data cleanup, format adjustments, or other specific changes needed to address the issues identified.
Restart Migration: Once the necessary fixes are made, restart the migration operation. Ensure that the system is in a stable state and that there are no interactions with instances or authority records during this time to prevent further complications.
FAQs
Q: What should I do if a migration operation fails?
A: Follow the steps outlined in the 'Handling Migration Failures' section. Locate the failure files, analyze them to determine the cause of the failure, fix the identified issues, and then restart the migration operation.
Glossary
Migration Operation: The process of migration of database records, which may involve changes to the data format, restructuring, and cleaning. In the context of MARC Migration, the current supported operation type is "remapping."
Data Mapping: The phase in the migration process where source data is transformed according to the current mapping rules.
Data Saving: The phase in the migration process where the transformed data is saved. This step finalizes the migration by persisting the new data format in the system's storage.
Remapping: The process of reassigning data from one schema or structure to another, involving transformation or adaptation to new mapping rules.
Mapping Rules: Mapping rules determine how MARC data is converted to FOLIO's record formats.
Sequential Processing: Handling tasks one at a time in a specific order, without parallel or simultaneous execution. This is relevant in environments where tasks are dependent on the completion of previous tasks.
Chunk Size: In data processing, the size of a batch of records is handled together in one operation. This term is significant in optimizing performance during data-intensive operations like migrations.
S3 (or compatible storage): A cloud storage of large amounts of data. In the context of MARC Migration, it is used for storing data files during the migration process.
References and Resources
Change Log
2024-10-26: Initial release of the documentation.