Solution to slice large data import files into chunks

Summary

As our Data Import application cannot process very large files, one possible solution can be to slice up a large data import source file into smaller chunks (files) and run Data Import Job for every chunk file separately.

MODSOURCE-630 - Getting issue details... STATUS

In order to solve a problem with processing very large files by Data Import, the proposal is to implement two minor features in the Data Import module and not create a separate utility tool.

Splitting files by a separate tool will not bring the expected reliability to the Data Import process because the file upload step will still be included in the process.


The first idea is to let the Data Import app download the source file from the S3-like storage instead of consuming it as a server for uploading. Thus the Data Import initial stage will look the following:

  1. The user starts the Data Import job by uploading a source file to the S3-like storage using Data Import UI and providing the Job Profile that must be used for Data Import.
  2. The Data Import application downloads the file from the S3-like storage to the local file system.
  3. The Data Import application continues the usual source file processing once the file is downloaded to the local file system of the mod-data-import module.


This change will make the initial stage of the Data Import application more reliable and prevent a potential denial of service (DoS) attack in which a threat actor can fill up disk space. In addition, the risk of uncontrolled resource consumption in the case of multiple simultaneously running Data Import file uploads is also eliminated.

The existing approach, when the user uploads a source file directly to the Data Import app, will be removed. It will make the mod-data-import module stateless and allow us to scale this module horizontally (Stateless, Horizontal scaling, and High Availability), making it HA-compliant.


The second improvement is to implement large data import file slicing logic in the Data Import application as well.

When the user starts a Data Import Job providing a source file, if the file size is greater than the maximum allowed, the Data Import application splits the original file into a number of chunks using a predefine naming schema and starts a separate Data Import Job for every chunk file created. The chunk files should be kept in S3-like storage as well. The logic regarding calculating the number of chunks should be configurable so that every deployment can be provided with reasonable values.


Requirements

Functional requirements

  1. The max chunk file size or the max number of source records in the chunk file must be configurable at the tenant level.
  2. Records would need to be chunked and named based on the sequential order of the records in the original file, e.g. records 1-1000 in chunk file_1, records 1001-2000 in chunk file_2, etc.
  3. The user can see the progress of the file uploading process.

Non-functional requirements

  1. The implementation must be decoupled from the mod-data-import main code base and simple enough to make the backporting of it to the previous releases at least twice cheaper (in terms of man/days) than the original development effort. TBD: define the list of releases for backporting.
    1. The alternative option is Development in Nolana/Orchid codebase and forward porting to the Development branch.
  2. The usage of the S3-like storage should not be "vendor locked" and must support different types of storage (AWS S3, MinIO)

Assumptions

  1. Garbage collection (removing already processed files and chunk files) is out of the scope of the feature. It can be done by configuring appropriate retention policies on S3-like storage.
  2. Every tenant will have its own dedicated S3-like storage area

Implementation

The solution will be implemented as a part of the mod-data-import and ui-data-import modules.

High-level operation overview

Implementation notes

Direct uploading

Uploading to S3-like storage directly from a FOLIO UI application can be implemented using the following guide https://aws.amazon.com/blogs/compute/uploading-to-amazon-s3-directly-from-a-web-or-mobile-application/. The initial call to acquire the uploadURL must be done by the back-end mod-data-import module.

The diagram below represents in detail the Direct upload flow.


The alternative solution could be the usage of an AWS SFTP-enabled server from the AWS Transfer Family service, but it has the following drawbacks:

  • It is an extra service that should be set up and configured
  • This service will require an extra security configuration (it requires a separate identity provider to manage user access to the SFTP server)
  • The price for US-East1 is $0.30 per hour per endpoint (~$216 per month) + $0.04 per gigabyte (GB) transferred

Based on the above, Direct uploading is preferable to the usage of the managed SFTP Server.

Simultaneous launch of a large number of Data Import Jobs

To smooth the spike of resource consumption by the mod-data-import module when starting a large number of Data Import Jobs, it is necessary to establish a queue for jobs that do not have enough resources eliminating resource deplation.

The recommended approach is to create a DB table that stores job details and use it to organize the queue. This table must be established in a dedicated schema, distinct from tenant-specific schemas, thereby enabling job data for all tenants to be stored in a centralized location. This straightforward method allows for easy retrieval of jobs for each tenant, with the next Data Import Job selected based on either priority or complexity/size. Consequently, Data Import Jobs with a smaller number of records may be given higher priority, and be processed in between chunks of larger Data Import Jobs.  

Result aggregation

To streamline the aggregation of results from Data Import Jobs that process chunk files, it is essential to establish a connection with the primary Data Import job, which must be defined at the outset of the operation. This means that prior to initiating Data Import Jobs for chunk files, it is necessary to create the primary Data Import Job, to which the Data Import Jobs for chunk files will be linked. By doing so, we will be able to retrieve all the logs that pertain to the source file. It is worth noting that all the necessary data structures and relationships are already in place in the Data Import app (mod-source-record-manager).


Rough WBS


TaskCommentSize
UI



Select DI Job Profile
Small

Get UploadURL
VS

Implement file Select dialog with filtering
Medium

Upload a file to S3-like
Small

Invoke Start Data Import Job
Small

Navigate back to a Landing page
VS

Disable old file upload functionality. Do not removeConfiguration parameter to switch between implementationsMedium
Backend



integrate folio-s3-clientAdd dependency, configurationSmall

Implement S3 interaction for uploadURL
Medium

Add end-poing to accept Start Import Job
Large

Implement logic to slice the file into chunks
Large

Implement logic to start chunk processing using HTTP callsMaster DI Job must be created. All chunk jobs must be linked to the master DI JobsMedium

Implement DI queue management and processing
XL

Consolidate all exceptions into the file


Implement all Jobs cancellation in one go.
Small

Karate tests


E2E tests


Sizes
Very Small< 1 day
Small< 3 days
Medium< 5 days
Large< 10 days
XL< 15 days
XXL< 30 days