Bulk-edit file upload/download functionality approach

The mod-data-export-spring and mod-data-export-worker projects are well suited for performing required tasks since they already contain configured mechanisms for saving files to s3 bucket, as well as integration with the spring batch library, which allows you to process large amounts of data quickly and comfortably. Also, the library provides very flexible "job" customization.

In the Import and Export execution, there is a batch job that has the following items:

  • ItemReader – obtains the data from the particular resource
  • ItemProcessor – processes the data and contains the necessary business logic
  • ItemWriter – writes data to the required source. It can be S3 or some other module where it is needed to POST/PUT the data

You only need to configure your own Job once(Bulk edit job), configure the execution steps for it, to which the reader, processor, and writer are bound.

Main problem

For mod-bulk-edit, it is needed to upload a file with a list of UUIDs, the JobCommand that is sent to mod-data-export-worker cannot transfer the full list of UUIDs in its body since their number can be huge. Same problem in the case of uploading a user-edited CSV file.

Solution

In mod-data-export-worker, we have to add endpoints that will upload the file needed for BulkEditJob execution. When the JobCommand is submitted and the file is loaded, the BulkEditJob is going to be launched by a JobLauncher. The approach is similar to the approach in mod-data-export, where a file definition is created first, and then a file is uploaded for it.


1) POST /data-export-spring/jobs – create JobCommand
2) POST /bulk-edit/{{jobId}}/upload – upload file for the job (new ednpoint)
3) GET /data-export-spring/jobs – get the job when it has status COMPLETED with updated files field(exported files URLs)


         








New endpoints and related permissions

1) POST /bulk-edit/{{jobId}}/upload – upload file for the job 

Is used for uploading a file required for the job to be launched.

Permission - bulk-edit.upload.post

2) GET /bulk-edit/{{jobId}}/preview - returns 10 records for preview on UI

Permission - bulk-edit.preview.get



Bulk edit jobs flow

1) Edited CSV file upload job

In the first step, the JobCommand is posted to data-export-spring with required metadata that defines the type of the job and etc. 

In the second step, the edited by the user CSV file is uploaded by JobCommandId

By receiving the file the JobLauncher launches the related bulk edit job that triggers the step execution.

 Here we have to implement the next things CsvItemReader, CsvItemProcessor, and JsonItemWriter.

The flow is simple

  • CsvItemReader reads a number of lines and passes them to CsvItemProcessor.
  • CsvItemProcessor translates the CSV string records to Json records and sends them to JsonItemWriter
  • JsonItemWriter performs PUT requests to inventory-storage for updating the records which the changes performed by the user

2) Upload a list of UUIDs and get the CSV file with inventory records


Initial steps are the same - create a JobCommand and upload a CSV file with UUIDs.

By receiving the file the JobLauncher launches the related bulk edit job that triggers the step execution.

Here we have to implement the next things ItemReader, JsonToCsvItemProcessor, and CsvItemWriter.

The flow is simple

  • ItemReader reads a chunk of UUIDs and passes them to JsonToCsvItemProcessor.
  • JsonToCsvItemProcessor request inventory storage for records by UUIDs and translates the obtained JSON records CSV strings that are passed to CsvItemWriter
  • CsvItemWriter writes lines to CSV file
  • When Job gets status COMPLETED then the file is stored to S3 bucket and the JobCommand is updated with these stored files links

3) Upload a CQL and get the CSV file with inventory records

The flow is similar to the UUIDs upload Job but in the case of CQL the file with UUIDs is not posted and the CQL can be put directly to the JobCommand body. Here when the JobCommand is posted with CQL specified, the required bulk edit job can be launched immediately.

In the case of CQL, we have to download a list of matched record UUIDs and save them to a local file and use the same flow for processing the list of UUIDs.

Error handling and file validation.

Spring batch supplies components that allow to process the definite exceptions or just skip items that cause the exception. Let's consider them in an example. 

1) Upload a CSV file with UUIDs

In such a case, if some records cannot be found by its UUIDs the specific BulkEditException is thrown. This type of exception is configured for steps to be skipped, therefore the job will not fail. For handling this exception the SkipListener should be implemented where all the necessary logic for logging/saving the exception will be placed. 

By the end of the job, the user will be able to get the both constructed CSV file with entities and the saved error logs.

2) Upload user-edited CSV file

This flow differs because here the fie should be validated first. For this purpose, a separate step can be implemented (simple TaskLet step). This file validation step is run first before invoking the execution of steps with the necessary business logic of file processing.

If a file contains a typo or was wrongly edited the Job will obtain FAIL status and an appropriate error log will be saved. 

The flow will be as usual but in case of errors while executing the "validate file step" the job will fail and the SkipListener will process the error. The value of skipLimit for a step configuration is set to 0. 

When the error has been processed and the job finishes with status FAIL the JobListener updates the JobCommand with status FAIL and the user will be able to get the error logs by this job id to define the errors. 

Important

The link to the same file should be put to the Job Command entity since the user should have a way to get the same file back in a non-edited state. If the file has been expired then no file link will be added to the JobCommand.

Performance increasing

Partitioner

To speed up the file processing the process can be split into several parts and each part can be processed asynchronously. This way becomes available by using Partitioner from the spring batch library. Partitioner allows splitting some step(master step) to a definite number of slave steps which brought the same Reader, Processor, and Writer but works with a defined part of the common data instead of processing the whole data in a single thread manner. The one thing that developers should take into account is that the reader and writer should support async processing. 

There's a Step called “Master”(manager), whose execution is divided into some “Slave” steps(workers). These slaves can take the place of a master, and the outcome will still be unchanged. Both master and slave are instances of the same Step.

Aggregator

Spring batch supplies component that allows to aggregate the results of executed steps. This can be achieved by implementing the StepExecutionAggregator and attaching it to a step via stepBuilder.aggregate(StepExecutionAggregator agreggator).

Example overview 

Let's consider that the user uploads a CSV file with 1m UUIDs.

The file can be split into parts by Partitioner. Let's assume that each slave step should process 100_000 records, therefore there will be 10 slave steps running where each slave step will process its 100000 UUIDs and will write the result in a separate tmp file. 

The first slave step will read lines from 0 to 100_000, the second step from 100_001 to 200_000 and etc. Here the AsyncItemReader should be used in order to synchronize the access to the file from which we are reading the data. When all steps will be completed there will be 10 tmp files. 

Each step writes the result to a separate file because we do not want to synchronize the writer since it will deteriorate the performance and will lose the async processing. We want to have steps that perform processing and don't bother each other by waiting for the writer lock to be opened.

When the execution of the steps finishes the Aggregator aggregates these files into a single CSV file and inside the JobListener the file is uploaded to the S3 bucket using Minio integration. (Aggregation of tmp files to CSV already implemented, we have just to reuse it).


Replacing keys with values

In the CSV file that we provide to the user, it is required to replace external keys with values of entities to which these keys are referred. Therefore it is required to obtain the reference data and use it while job running.

Requesting this reference data each time when a chunk of items is processed is not meaningful. The best way to work with this data is to cache it once before running the job as it was implemented in mod-data-export.

For caching mechanism, the Caffeine library, which is already integrated with the spring boot, can be used. The good point is that the data can be auto cleaned, it is needed just to specify the expiration time. 

The data will be cached within the Partitioner code and will be available through ExecutionContext for each slave step.

Records types and job types handling

Since the bulk edit should work with the different types of records the differents strategies for mapping JSON to CSV and back will be used. 

In order to know which particular job is required to be run and which record type the particular job should handle the related metadata should be put to the JobCommand entity.

JobCommand metadata

JobCommand ExportType
The new values should be added:

1) BULK-EDIT-IDENTIFIERS - export type is used for uploading a CSV with UUIDs or barcodes
2) BULK-EDIT-UPDATE - export type is used for uploading an edited CSV file

JobCommand IdentifierType (new field)

1) ID
2) BARCODE

JobCommand RecordType (new field)

1) USER

JobCommand metadata validation 

Specified ExportTye defines the required fields for the JobCommand metadata

1) ExportType BULK_EDIT_IDENTIFIERS
Required metadata fields  - IdentifierType and RecordType (when posting CSV with identifiers it is required to know what are these identifiers - barcode or UUID; to which type of records these identifiers are related)

2) ExportType BULK_EDIT_UPDATE

Required metadata fields  - RecordType only (when posting CSV with updated records it is required to know what type of records are described within the file in order to know which mapping should be used)

Error handling and file validation.

Spring batch supplies components that allow to process the definite exceptions or just skip items that cause the exception. Let's consider them in an example. 

1) Upload a CSV file with UUIDs

In such a case, if some records cannot be found by its UUIDs the specific BulkEditException is thrown. This type of exception is configured for steps to be skipped, therefore the job will not fail. For handling this exception the SkipListener should be implemented where all the necessary logic for logging/saving the exception will be placed. 

By the end of the job, the user will be able to get the both constructed CSV file with entities and the saved error logs.

2) Upload user-edited CSV file

This flow differs because here the fie should be validated first. For this purpose, a separate step can be implemented (simple TaskLet step). This file validation step is run first before invoking the execution of steps with the necessary business logic of file processing.

If a file contains a typo or was wrongly edited the Job will obtain FAIL status and an appropriate error log will be saved. 

The flow will be as usual but in case of errors while executing the "validate file step" the job will fail and the SkipListener will process the error. The value of skipLimit for a step configuration is set to 0. 

When the error has been processed and the job finishes with status FAIL the JobListener updates the JobCommand with status FAIL and the user will be able to get the error logs by this job id to define the errors. 

Important

The link to the same file should be put to the Job Command entity since the user should have a way to get the same file back in a non-edited state. If the file has been expired then no file link will be added to the JobCommand.

Files retention/cleaning.

The files can be stored in an s3 for a limited period of time. By default, the value is equal to 14 days. 

The related setting should be placed at the tenant level, therefore it can be stored in a mod-configuration. 

For cleaning the files the scheduled job should be implemented. It is a simple job with a single TaskLet step that contains removing expired files business logic.

The job can be scheduled using @Scheduled annotation with specified cron expressions like "0 */2 * * *" which means every 2 hours. 


AWS cost

Records per file/files number

File size (Mb)

 

average line length = 2102 characters

Request’s count needed to upload a file

AWS cost $ per month

1 000 000 records per file

1 file

1000 files per month


6 Mb

6 Gb

 

600

600_000

 

 

3,14

 

1 000 000 records per file

1 file

500 files per month


6 Mb

3 Gb


600

300_000

  

1,57

1 000 000 records per file

1 file

300 files per month


6 Mb

1,8 Gb


600

180_000

 

0,94

2 000 000 records per file

1 file

1000 files per month


12 Mb

12 Gb


1200

1_200_000


6,28

2 000 000 records per file

1 file

500 files per month


12 Mb

6 Gb


1200

600_000

3,14

2 000 000 records per file

1 file

300 files per month


12 Mb

3,6 Gb


1200

360_000


1,88

3 000 000 records per file

1 file

1000 files per month


18 Mb

18 Gb


1800

1_800_000


9,41

3 000 000 records per file

1 file

500 files per month


18 Mb

9 Gb


1800

900_000


4,71

 

3 000 000 records per file

1 file

300 files per month


18 Mb

5,4 Gb


1800

540_000


2,82

Count of records

Example of fields that can be used on the UI-side to display information about count of records in bulk-edit

Field from ResponseDescriptionQuestion/assumptionAnswer
TotalCount of initial input records

ProcessedCount of processed records in bulk-edit

MatchedCount of matched (updated) records in bulk-edit

 

Magda Zacharska  could you please clarify if the User should see in 'matched records' the records that were updated on the last step of bulk-edit?

Denis  the expected behavior was described in UIBULKED-46 and was supposed to be delivered in scope of MODEXPW-69  but the current implementation differs from what was expected.    Matched records are displayed after the user uploads the file with identifiers or runs the query.  After the bulk edit is completed the accordion should be named Preview of records changed as displayed on the following mockup: https://drive.google.com/file/d/1Uyr5_6zOLWnWEyukwJINBWFy2l8LjmMk/view?usp=sharing

SuccessCount of successfully processed records in bulk-edit

ErrorsCount of records that have errors in bulk-edit


Migration risks and possible mitigation

Risk: Since we want to work with millions of records in the future the performance can deteriorate on a big data set.

Mitigation: The Spring Batch supplies the AsyncItemProcessor and AsyncItemWriter which allows doing logic asynchronously instead of using a single thread for processing a chunk of the data. As well the spring batch Partitioner allows splitting the processing of a file into several threads. We have to just configure the ThreadPoolTaskExecutor for Processor and Writer or Partitioner Step.