Data Import Issues and possible improvements (WIP)

Steps

Priorities

High, Med, Low

Complexity

S, M, L, XL, XXL


Problem definitionBusiness impactSteps (Proposed Solution)PriorityComplexityExisting Jira issuesComments
1DI relies on internal identifiers for SRS records
  • DI does not support differentiation of records based on external identifiers (ISBN or barcode numbers).
  • The criteria that we have to distinguish whether MARC record already exists in SRS or not is UUID stored in 999 ff. If incoming record has no 999 ff field we consider it as new, save it and assign new UUID. If 999 ff field is present - we increment the generation and save it as a new and actual version of the record. The problem is that sometimes incoming records does not have 999 ff field even though they already exist in SRS and have corresponding inventory instances linked.
  • Gather requirements - which fields of the incoming MARC Bib contain external identifiers, what about MARC Holdings and MARC Authority (should we make changes for them too?)
  • Design new mechanism for versioning of records  - based on external identifiers
  • Consider performance implications
  • Decide what to do with duplicates that already exist in SRS
HighXL

MODSOURMAN-848 - Getting issue details... STATUS

MODSOURCE-530 - Getting issue details... STATUS

MODSOURMAN-898 - Getting issue details... STATUS

MODDATAIMP-743 - Getting issue details... STATUS


2

DI profile actions can sometimes lead to other actions that are implicit / disposable incoming records

  • We don’t have an explicit action to save the SRS MARC record, it is implicit and happens (almost… we already have a couple of exceptional cases, which were added later as “bug fixes”) for each incoming file. When it was designed we thought of an incoming file as a new and valid record that should be saved prior to any other actions and serve as a single source of truth. In fact, what we have now - there are indeed records that are coming and should be saved in SRS and referenced by other entities that are derived from it. However, there are also multiple use cases (usually some kind of updates or creates on Holdings and/or Item), where incoming MARC record is considered to be disposable, it might contain only partial data, and if we save it we end up either with lost data (when original record is overridden) or with messed up links to corresponding inventory entities (when we save the record as new one)
  • Consider making Create/Update SRS MARC Bib explicit - a separate step in the profile
  • Alternatively some kind of a check box should be added when profile is constructed specifying whether MARC Bib is supposed to be saved or not
Med

MODSOURMAN-891 - Getting issue details... STATUS

MODSOURMAN-907 - Getting issue details... STATUS

MODSOURMAN-819 - Getting issue details... STATUS

MODDATAIMP-744 - Getting issue details... STATUS


3

Performance results in production environments is different than on PTF

  • Slower imports for big files, especially for Updates
  • Gather information on configs and amount of allocated resources for DI modules, background activity, other factors
  • It might be caused by a complexity of the profile, complex matching conditions. Get examples of profiles and maybe files to try them on PTF env and compare with results measured with our base cases
Med

MODDATAIMP-504 - Getting issue details... STATUS

MODDATAIMP-752 - Getting issue details... STATUS

MODSOURCE-581 - Getting issue details... STATUS

PERF-388 - Getting issue details... STATUS

MODDATAIMP-749 - Getting issue details... STATUS

MODDATAIMP-748 - Getting issue details... STATUS

MODDATAIMP-747 - Getting issue details... STATUS

MODSOURCE-565 - Getting issue details... STATUS

Recommended Maximum File Sizes and Configuration

4

UI regression bugs

  • Incorrect names in Edit mode for mapping profiles
  • Missing associated profiles on editing screen
  • Issues with shortcuts
  • Include test cases for editing profiles in the critical path
DONE

UIDATIMP-1302 - Getting issue details... STATUS

UIDATIMP-1296 - Getting issue details... STATUS

UIDATIMP-1233 - Getting issue details... STATUS

UIDATIMP-1300 - Getting issue details... STATUS

FAT-3438 - Getting issue details... STATUS

FAT-3437 - Getting issue details... STATUS


5

Intermittent failures of Karate tests

  • Some issues could be resolved earlier in dev cycle
  • Kafka config adjusted
  • Added number of retries before test completion
  • Resolved issues related to reference data
  • Continue investigation for tests failing with incorrect job status
Low

FAT-3397 - Getting issue details... STATUS

FAT-2302 - Getting issue details... STATUS

FAT-3591 - Getting issue details... STATUS


6Source code is missing when debug data import modules.
  • Decreased maintainability
  • Add sources to Data Import packages
LowS

MODDATAIMP-745 - Getting issue details... STATUS


7Reduce/remove the need for post processing in data import flows.
  • increases duration of data import job i.e. decreased performance
  • Makes data import flows more complex and prone to errors.
  • Blocker for import job rollback.

Low

MODDATAIMP-746 - Getting issue details... STATUS

This point refers to post processing step when Instance is created (see the linked issue). Post processing during importing Orders is a separate topic.
8Remove incomplete data import job monitoring process from mod-source-record-manager. Or implement a working monitoring if there is a business. We are currently incurring the cost without the benefits.
  • Increases communication to the database for every kafka event which has downstream effects on data import performance, database utilization & database storage.

DONE 

MODSOURMAN-908 - Getting issue details... STATUS


9

Update job_execution_progress table in mod-source-record-manager without doing a SELECT FOR UPDATE, then an update. The row is locked after SELECT FOR UPDATE which is causing contention for multiple SRM instances.

  • Decreased data import job performance.

DONE 

MODSOURMAN-846 - Getting issue details... STATUS


10

Data Import Processing Core needs to be refactored. This refactoring should allow a clear and concise API that FOLIO developers in other module areas can hook into data import system cleanly. For example, Inventory mapping should be stored in mod-inventory instead of data import processing core.



HighXL

MODDICORE-295 - Getting issue details... STATUS


11mod-data-import can only have one instance in a folio cluster due to its interaction with file storage. This has caused responsibilities it may have had to be moved to mod-source-record-manager.
  • limited availability for API endpoints serviced by mod-data-import

HighL

ARCH-19 - Getting issue details... STATUS

MODDATAIMP-392 - Getting issue details... STATUS

Notes on scalability of mod-data-import
12mod-source-record-manager has too many responsibilites.
  • High resource requirement for SRM instances
  • Over 200 threads even in idle state, raising potential for instability.
  • Higher cognitive load since the module is managing more than source records.

Med

MODSOURMAN-851 - Getting issue details... STATUS

MODDATAIMP-607 - Getting issue details... STATUS


13Generic backend error messages are returned to the user upon failures in data import. Data Import should employ error codes and specific error messages for issues that occur frequently.
  • Troubleshooting is harder for data import users as well as developers.

HighL

MODDATAIMP-922 - Getting issue details... STATUS


14

Sometimes functional/performance issues occur in production environments that are not easily reproducible in lower environments. Having job profiles, import files from prod that are executable in lower environment would be of great help.

Create tool that will allow import/export of data import profiles.



Med

MODDATAIMP-577 - Getting issue details... STATUS