Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Summary

As our Data Import application cannot process very large files, one possible solution can be to slice up a large data import source file into smaller chunks (files) and run Data Import Job for every chunk file separately.

...

The first idea is to let the Data Import app download the source file from the S3-like storage instead of consuming it as a server for uploading. Thus the Data Import initial stage will look the following:

  1. The user uploads starts the Data Import job by uploading a source file to the S3-like storage that is available to the Data Import application.
    1. The user can list already uploaded files and select which one should be used for processing.
    The user starts the Data Import job and provides the source file's location in the S3-like storageusing Data Import UI and providing the Job Profile that must be used for Data Import.
  2. The Data Import application downloads the file from the S3-like storage to the local file system.
  3. The Data Import application continues the usual source file processing once the file is downloaded to the local file system of the mod-data-import module.

...

This change will make the initial stage of the Data Import application more reliable and prevent a potential denial of service (DoS) attack in which a threat actor can fill up disk space. In addition, the risk of uncontrolled resource consumption in the case of multiple simultaneously running Data Import file uploads is also eliminated. The existing approach, when the user uploads a source file directly to the Data Import app, will be preserved for backward compatibility, but the max size of files that can be processed using this approach will be significantly reduced.

...

When the user starts a Data Import Job providing a source file location, if the file size is greater than the maximum allowed, the Data Import application splits the original file into a number of chunks using a predefine naming schema and starts a separate Data Import Job for every chunk file created. The chunk files should be kept in S3-like storage as well. The logic regarding calculating the calculation of the number of chunks should be configurable , so that every deployment can be provided with its reasonable values.These changes will allow to separate the source file uploading and processing operations. So, the user can perform a quite long-lasting file upload operation beforehand at any time that suits him/her. And the actual Data Import Job can be started at the appropriate time, at the end of the working day, for example.


Requirements

Functional requirements

...

The solution will be implemented as a part of the mod-data-import module.

High-level operation overview

...

Image Added

Implementation notes

Direct uploading (1, 2)

Uploading to Amazon S3-like storage directly from a FOLIO UI application can be implemented using the following guide https://aws.amazon.com/blogs/compute/uploading-to-amazon-s3-directly-from-a-web-or-mobile-application/. The initial call to acquire theĀ uploadURL must be done by the back-end mod-data-import module.

The diagram below represents in detail the Direct upload flow.

Image Added

Simultaneous launch of a large number of Data Import Jobs (9)

To smooth the spike of resource consumption by the mod-data-import module when starting a large number of Data Import Jobs, it is necessary to organize a queue for jobs that do not have enough resources eliminating resource exhaustion.