DRAFT Overview: reliably process large MARC files

This functionality enables libraries to reliably process large MARC 21 files through data import. When enabled, the system automatically splits large files into smaller parts, which are optimal for processing.

Key functionality

File upload process is more reliable
New functionality to automatically slice large files into smaller, optimal parts
Consolidated parent/child card to provide visibility in the “Running” portion of the UI, plus a log entry for each part of a sliced file,
Link to download the MARC file with the records for the sliced portion of the file.
Configurable queue management and prioritization*
Enabling parallel processing

Slicing functionality can be enabled by your hosting provider; configuration for queue management and prioritization is also managed by your hosting provider, and all setting apply at the cluster level (meaning all institutions on a given cluster will have the same settings).

Background - Problem statement

I am a librarian who is trying to load a file (which can be of any reasonable size) into the system. For it to load successfully, I must manually break up the file into smaller files AND I must manually load each file. The process is painful and time consuming.

In my previous system, I was able to load a single file of 100,000 records without the system breaking.

📇 Details of improvements

Improvements to file upload process

File upload improvements are mostly seamless to the end user
Behind the scenes improvements to uploading a file have made the initial stage of data import more reliable
Previous implementation used local storage on EC2; now files are directly uploaded to S3-like buckets.

Auto slicing of large MARC 21 files

Large MARC 21 files are automatically sliced into smaller (optimal) parts
Files are split based on setting for max number of records per part (configuration is handled by your hosting provider).

Example: A max record setting of 1,000 records is in place. A file with 800 records is input, so the file is not split. A file with 1,800 records is split into 2 parts (1,000 records and 800 records). A file with 3,001 records is split into 4 parts (1,000 and 1,000 and 1,000 and 1).

When files are split, parts, and records within them, stay in sequential order based on the original file
- the ordering of the parts is based on the record sequence in the initial file
- the ordering of records in each part is based on the record sequence in the initial file
- the file part names are in sequential order based on the initial file
- the Job IDs increment based on the ordering of the initial file

Example: a file contains records A-Z in alphabetical order. With a max record setting of 5 records, the file is split into 6 parts. The first part contains the first 5 records (A, B, C, D, E), is named Import_1, and has a job ID of 1. The second part contains the next 5 records (F, G, H, I, J), is named Import_2, has a job ID of 2. Parts 3-6 follow the same pattern.

Screenshots of UI when initiating a large import

Message indicating a large file may be split

Visibility into parent/child jobs in the UI

Flows: Processing a file in the Data Import app Flows: processing a file in the Data Import app

When a file is split, the parts, become child jobs that are submitted under the umbrella parent job.
- The UI shows a consolidated card in the ‘Running’ area with an overall progress bar and percentage
- Count of job parts remaining (38 of 50), count of job parts processed (12 of 50)
- Count of job parts by status: Completed: 10, Completed with errors: 0, Failed: 2
When files are split, the part is appended to the file name in the log (Import_1.mrc, Import_2.mrc, etc.)
A new column was added to the logs so you can easily see the job part for an entry.
- Sliced files will display 12 of 50
- Files that are not sliced will show 1 of 1. Note: this includes single record imports from inventory, EDIFACT files, or MARC files smaller than the configured max record value.

UI updates for auto-split files

Link to download the MARC file with the records for the sliced portion of the file

Links to download the source file are only available when the file-splitting feature is enabled.

When troubleshooting, it helps to have access to the file used during a Data Import job without asking the original user for a copy. When the system automatically splits a file, providing a way to download the composite parts is necessary for troubleshooting since the user who initiated the import won't have those sliced files.
For auto split files, the log entry for each slice/part will contain a link to download the split marc file for all records in that part.

for example: if a user uploads a 10,000 record marc file and it's auto sliced into 10 files of 1,000 records.

this would result in 10 log entries, one for each part containing 1,000 records
the file linked in the logs for part 1 would include a file for records 1-1,000

For all other files*, log entries contain a link to download the entire source file. *Links to download the source file will be available for small MARC files that aren’t split, or other valid file types, such as EDIFACT.

Cancelling an auto split import

Cancelling a file that’s being processed:
- Cancelling an import does not revert records that have already been created or updated (this is true for all imports from the data import app, regardless if they’ve been split or not).
- When an auto split file is cancelled, all child job parts will appear in the logs, with the status of “Stopped by user.”
- Job parts that are actively being processed will end up in a mixed state where some records have been processed, and others have not.
- Job parts that have not yet been processed can be stopped.

View of an auto split file being cancelled

Flows for cancelling a job (previous flow vs auto split file)

Configurable queue management and prioritization

All of the new configuration settings are handled by your hosting provider. These settings can be adjusted to be optimal for your institution and environment.
As an end-user, you won’t need to worry about many of these details work - however - your overall DI experience will benefit.
All settings apply at the cluster level, meaning all institutions in a multi-tenant setting share the same settings, including:
- Enabling/disabling the feature
- Determining record count per split file
- Fine-tuning of how jobs are prioritized
For more technical details: Detailed Release notes for Data Import Splitting
Small jobs will no longer get stuck waiting behind large jobs. Allows small jobs to "jump" queue.
No more waiting for the entire job to finish to see the results in the log. Job parts appear in the log as soon as they complete processing
Previously DI processed jobs strictly using FIFO - first in, first out. This was not, however, transparent to the end user. Libraries in a multi-tenant cluster sometimes had DI jobs affected by other DI jobs by a different tenant in the same cluster. Jobs being held up by larger jobs, but had no visibility
In a multi-tenant cluster, there will be a more even distribution of jobs by different tenants

Example: if Library A starts a large file at 9am, Library B starts a large file at 9:01am, and Library C starts a small file at 9:05am. As soon as the large file from Library B is submitted, the queue will automatically distribute processing so job parts (slices of the larger file) from Library A and Library B are both processed. As soon as Library C submits the small file, it jumps to the top of the queue to be processed next.

Examples of different hosting configurations

Enabling parallel processing

Because all jobs (including those submitted from the UI) will go through S3 instead of local storage, we can now run several instances of DI in parallel as desired

📈 Performance

TODO: Include additional info on performance testing to date, along with file size recommendations

Performance testing: https://folio-org.atlassian.net/wiki/x/pG0V

FAQs: (TODO, still a work in-progress)

Q: Must I do anything to my data import settings and profiles to use this functionality? A: nope! your existing data import settings and profiles will continue to work with this functionality.
Q: What happens if you enable slicing then disable? A: The biggest change you’ll notice is that you will no longer have the ability to download the source file in the UI. Previously processed jobs will continue to display in the logs.
Q: What about single record import from inventory?
Q: Why is my file being split into parts with X records?
Q: I want my file to be split into…?
Q: Why does the link to the initial file have X error? (Retention policy in AWS)
Q: What about EDIFACT files? Are they also sliced into smaller parts? do EDIFACT jobs also take priority over large MARC jobs? A: EDIFACT jobs (which are usually much smaller than MARC jobs) will not be sliced into smaller parts. EDIFACT jobs are scored and ranked the same way as MARC jobs, so they will take priority due to their smaller record count.
Q: Can you have queue management and parallel processing enabled but not data import record slicing? And vice versa? A: In order to use any of this functionality, your hosting provider needs to turn on the main feature. Your hosting provider can potentially adjust your settings so the record count per slice is larger than the largest file you plan to upload = your files would be “sliced” into a single part. Not slicing large files will likely have a large performance impact. Similarly, the queue management can be configured in a variety of ways, including prioritizing files to mimic previous versions of Data Import (First in, First out).
Q: What about a file submitted via API?