Batch Importer (Bib/Acq) (UXPROD-47)

[UXPROD-4337] Reliably process large files in DI by automatically splitting and processing source files Created: 06/Jun/23  Updated: 08/Feb/24  Resolved: 04/Dec/23

Status: Closed
Project: UX Product
Components: Batch Importer
Affects versions: None
Fix versions: None
Parent: Batch Importer (Bib/Acq)

Type: New Feature Priority: P1
Reporter: Kathleen Moore Assignee: Kathleen Moore
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: JPEG File DI - Large jobs - Current large job DI user flow in FOLIO.jpg     JPEG File DI - Large jobs - Detail_ Existing flow for initiating an import from DI app.jpg     JPEG File DI - Large jobs - Detail_ Existing flow showing progress of import from DI app.jpg     JPEG File DI - Large jobs - Large job DI user flow in other ILS.jpg    
Issue links:
Defines
defines UIDATIMP-1563 'Running' card shows incorrect totals Closed
is defined by PERF-565 SPIKE: Slice large data import files ... Closed
Relates
relates to MODDATAIMP-843 Spike: Investigate 'queued' status on... Open
relates to UIDATIMP-1489 FE splitting a file when error occurs Open
relates to MODDATAIMP-842 BE API to get chunk file from S3  Closed
relates to MODDATAIMP-852 BE How to configure the UI to support... Closed
relates to UIDATIMP-1464 Provide visibility into initial file ... Closed
relates to UIDATIMP-1466 FE Provide visibility into composite ... Closed
relates to UIDATIMP-1469 FE Cancel a running split job Closed
relates to UIDATIMP-1510 Display a link to download split file... Closed
relates to FAT-7307 Karate Test: Data Import Task Force —... Closed
relates to FOLS3CL-11 Create method to get presigned URLs f... Closed
relates to MODDATAIMP-846 Implement GET /uploadUrl method Closed
relates to MODDATAIMP-849 Knowing how to split files Closed
relates to MODDATAIMP-850 Spike: Understand BE use of presigned... Closed
relates to MODDATAIMP-853 Create a service to asynchronously sp... Closed
relates to MODDATAIMP-857 Make /uploadURL endpoint multipart aware Closed
relates to MODDATAIMP-860 Create a service to asynchronously sp... Closed
relates to MODDATAIMP-863 [SPIKE] Benchmark multipart uploads w... Closed
relates to MODDATAIMP-893 BE Cancel a running split job Closed
relates to MODSOURMAN-1062 Changes in mod-srm to allow submittin... Closed
relates to UIDATIMP-1463 Notify the user that larger files wil... Closed
relates to UIDATIMP-1467 Provide visibility into composite par... Closed
relates to UIDATIMP-1468 FE: Create functionality to split a f... Closed
relates to UIDATIMP-1472 Notify user when error occurs Closed
relates to UIDATIMP-1487 Uploading a file errors Closed
relates to UIDATIMP-1488 BE splitting a file when error occurs Closed
relates to MODDATAIMP-832 Create a DB table to keep Data Import... In Progress
relates to MODDATAIMP-820 Check if we have several data import ... Closed
relates to MODDATAIMP-829 Implement Large Data Import File Slic... Closed
relates to MODDATAIMP-830 Investigate how we access to public s... Closed
relates to MODDATAIMP-831 [BE] Simultaneous launch of a large n... Closed
relates to MODDATAIMP-833 Implement DI queue management and pro... Closed
relates to MODDATAIMP-834 [SPIKE FE and BE]Investigate using pa... Closed
relates to MODDATAIMP-835 Where configuration parameters are co... Closed
relates to MODDATAIMP-836 Implement new Data Import entry point... Closed
relates to MODDATAIMP-837 Determine Schema and algorithm requir... Closed
relates to MODDATAIMP-838 Spike - Implement S3 interaction for ... Closed
relates to MODDATAIMP-839 Explore UI (check current Data UI inf... Closed
relates to MODDATAIMP-861 Define rules to prioritize chunks in ... Closed
relates to MODDATAIMP-864 Create a service to implement priorit... Closed
relates to UIDATIMP-1460 Spike: Understand FE use of presigned... Closed
relates to UIDATIMP-1462 FE Retrieve information about method ... Closed
relates to UIDATIMP-1475 Use S3-like buckets to upload source ... Closed
relates to UIDATIMP-1476 Select Data Import Job Profile (UI) Closed
relates to UIDATIMP-1477 Implement Select source file Dialog w... Closed
relates to UIDATIMP-1478 Implement file uploading to S3-like s... Closed
relates to UIDATIMP-1479 [SPIKE]Implement S3 interaction for p... Closed
relates to UIDATIMP-1480 Implement new Data Import entry point... Closed
relates to UIDATIMP-1481 Initiates the Data Import Job (UI) Closed
relates to UIDATIMP-1482 [SPIKE]check the existing solution[BE] Closed
Requires
is required by MODSOURCE-627 SPIKE: Design restructuring of the DB... Open
Release: Poppy (R2 2023)
Epic Link: Batch Importer (Bib/Acq)
Development Team: Data Import Task Force
Report Functional Area(s):
Import and Export
PO Rank: 0

 Description   

Problem:

Loading large files* into the system isn’t reliable. The current process is inconvenient: it’s labor-intensive, time-consuming, and needs to be done off-hours to avoid negatively impacting the system. Data import needs to reliably support loading and successful processing of large files, 

*Large file = a file of any reasonable size, typically with 100,000+ records.

Background:

  • Record loads of a small amount (less than 1,000) may time out or take a long time to complete
  • FOLIO users must plan to work off hours to complete import jobs, otherwise it can bring other systems down
  • Libraries are unable to complete large cataloging projects
  • Libraries are scared to even try the larger loads because they will have to deal with the "mess"
  • Smaller jobs (jobs with a handful of records) get delayed by larger jobs, often by hours. 

In the current system, to successfully load a large file, I must:

  • manually break up the original file into several smaller files (~1,000-5,000 records per file)
  • wait until the end of the day to kick off imports for each of the smaller files (otherwise these imports can negatively impact the system)
  • The total elapsed time can be days to weeks to load all records in a large file.

 

Out of scope

  • Existing data import flaws will not be addressed
  • Major changes to the existing Data Import workflow

 

Use case(s)

  • Reliably complete large file imports with the following actions:
    • create
    • modify
    • update
  • Reliably complete large file imports with the following record types:
    • MARC Bibliographic records
    • MARC Authority records
    • MARC Holdings records
    • EDIFACT invoice records
  • When importing a single large file via Data Import::
    • I can view the status of the file being uploaded/split
    • I can view the status of a large job that is running
    • I can view the status of large job that has completed in logs
    • In the event errors occur while processing my large file, I can see information about the error(s)
    • While the large file is being processed, I can cancel the import (cancelling doesn’t undo whatever was done prior to the point of being stopped)
  • The following can be configured:
    • Feature flag on/off (at the cluster level)
    • Max file size
    • Number of active data import jobs
  • TODO: Identify critical profiles / files for us to test
  • TODO: UI mockups

 

Proposed solutions/stories:

  • Direct uploading of large files: AWS S3 to upload source files with MARC records
  • File slicing logic: splitting/chunking of large files automatically
  • Process each chunk file independently
  • Implementation of Data Import queue management to allow processing small jobs in between larger chunks 
  • Result aggregation: aggregate rules of Data Import Jobs that process chunk files

 

NFRs:

  • Establish performance baselines: the response times or throughput of all tested endpoint methods, pages or operations do not degrade in performance by comparing to an agreed upon metric or baseline
  • Processing of large jobs is performant during the day
  • Other systems downstream are not negatively affected by large data import jobs
  • Additional requirements (questions and assumptions)

 

Links to additional info

 

Questions

  • What is the maximum load?
  • What need to be logged?
  • Specific requirements about errors?

Generated at Fri Feb 09 00:39:07 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.