Data Import Stabilization plan

Data Import Stabilization plan

Steps

Gather existing issues (@Vladimir Shalaev , @Kateryna Senchenko )
Create new features (@Vladimir Shalaev, @Kateryna Senchenko )
Provide feature dependencies (@Vladimir Shalaev , @Kateryna Senchenko )
Estimate (priorities + complexity) (@Vladimir Shalaev , @Kateryna Senchenko )
Remove duplicates (grooming with Ann-Marie)
Final priorities
Align to timeline, and assign to appropriate Jira Feature, and review Jira issue priorities (@Taisiya Trunova)

Categories

See : Assessment ratings

  1. Performance: di-performance

  2. Stability/Reliability: di-data-integrity (more tags to be added)

  3. Scalability

  4. Architecture

  5. Code quality

Priorities

High, Mid, Low

Complexity

S, M, L, XL, XXL

Table

Category

Problem definition

Business impact

Proposed solution

Priority

DEV

Priority

PO

Complexity

Existing Jira item(s)

Current feature(s)

Final feature (s)

Category

Problem definition

Business impact

Proposed solution

Priority

DEV

Priority

PO

Complexity

Existing Jira item(s)

Current feature(s)

Final feature (s)

1

Performance

Kafka producer closed after sending

Low performance of import

Create pool of active producers. Start pool on module launch, close on shutdown. Reuse connections.

Add max/min pool sizes.

High

 

L

https://folio-org.atlassian.net/browse/MODDATAIMP-499

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3191

2

 

WARN message when no handler found

none

Do not subscribe to messages you're not going to process

OR

Lower log lever for this type of messages

Low

 

S

https://folio-org.atlassian.net/browse/MODSOURCE-340

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3191

3

Stability/Reliability

Race condition on start (Kafka consumers start working before DB is configured)

OR

Periodical DB shutdown after SRS restart. Jobs get stuck if not able to update status in DB (messages ACKed even if we could not process them)

Imports might get stuck on module restart

Need investigation / check

Investigate the issue with DB (possible OOM on PG server)

 

Mid

 

 

https://folio-org.atlassian.net/browse/MODSOURCE-339

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

4

Performance

Stability/Reliability

High CPU/Memory consumption on modules

Low performance of import. Higher costs for hosting

Significantly decrease size of payload:

  1. Remove immutable parts. Instead fetch them on demand and cache locally for reuse.

  2. Change message handling mechanism (currently relies on pt1 - profile) (optional)

  3. Move archiving to Kafka instead of module level

High

 

XXL

https://folio-org.atlassian.net/browse/MODDATAIMP-439

https://folio-org.atlassian.net/browse/MODSOURMAN-519

https://folio-org.atlassian.net/browse/MODINV-405

https://folio-org.atlassian.net/browse/MODINV-408

https://folio-org.atlassian.net/browse/MODINV-460

https://folio-org.atlassian.net/browse/MODINVOICE-251

https://folio-org.atlassian.net/browse/MODINVOICE-252

https://folio-org.atlassian.net/browse/MODPUBSUB-167

https://folio-org.atlassian.net/browse/MODSOURCE-286

https://folio-org.atlassian.net/browse/MODSOURCE-290

https://folio-org.atlassian.net/browse/MODSOURMAN-463

https://folio-org.atlassian.net/browse/MODSOURMAN-464

https://folio-org.atlassian.net/browse/MODSOURMAN-465

https://folio-org.atlassian.net/browse/MODSOURMAN-466

https://folio-org.atlassian.net/browse/MODSOURMAN-468

https://folio-org.atlassian.net/browse/MODSOURMAN-469

https://folio-org.atlassian.net/browse/MODSOURMAN-474

https://folio-org.atlassian.net/browse/MODSOURMAN-519

 

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

5

Performance

Kafka cache resource consumption

Low performance of import. Higher costs of hosting.

Remove Kafka cache. Modules that do not do persistent changes will sometimes (on duplicates read) do unnecessary calls. Can be optimized further upon adding distributed in-memory cache (ex hazelcast) (blocked by 6)

Mid

 

M

https://folio-org.atlassian.net/browse/MODINV-444

https://folio-org.atlassian.net/browse/MODINV-401

 

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3191

6

Stability/Reliability

Duplicates created upon import

Data inconsistency on import.

Make consumers behave idempotent. Add pass-through identifier to de-duplicate messages. 

High

 

XL

https://folio-org.atlassian.net/browse/MODDATAIMP-474

https://folio-org.atlassian.net/browse/MODDATAIMP-440

https://folio-org.atlassian.net/browse/MODDATAIMP-491

https://folio-org.atlassian.net/browse/MODDATAIMP-495

 

 

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

7

Stability/Reliability

Kafka consumers stop reading messages eventually, breaking job progress until module restart.

Imports eventually get stuck until module restart

Need investigation

High

 

?

https://folio-org.atlassian.net/browse/MODINV-417

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

8

Code quality

Test coverage is not high enough (Unit)

Higher amount of bugs

Write more tests

Mid

 

S

https://folio-org.atlassian.net/browse/MODPUBSUB-168

 

https://folio-org.atlassian.net/browse/UXPROD-2697

https://folio-org.atlassian.net/browse/UXPROD-2697

9

Code quality

Test coverage is not high enough (Karate)

Higher amount of bugs

Write more tests (define test cases)

Mid

 

L

https://folio-org.atlassian.net/browse/UXPROD-2697

https://folio-org.atlassian.net/browse/UXPROD-2697

https://folio-org.atlassian.net/browse/UXPROD-2697

10

Stability/Reliability

mod-data-import stores input file in memory, limiting size of uploaded file and possibly having oom

Data import file size is limited

Split to chunks, put to database, work with database/temp storage. Partially done (to be investigated)

Mid

 

L

https://folio-org.atlassian.net/browse/MODDATAIMP-390

https://folio-org.atlassian.net/browse/MODDATAIMP-392

https://folio-org.atlassian.net/browse/MODDATAIMP-465

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

11

Performance

Data import impacts other processes

Slower response of system during data import

Need investigation (possible solution - configure rate limiter)

Relates to number 4

 

 

 

https://folio-org.atlassian.net/browse/MODDATAIMP-517

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3191

12

Performance

High resource consumption to get job(s) status/progress

Slow performance of import and landing page.

Add some kind of caching for progress tracking (database or in-memory)

Low

 

S

https://folio-org.atlassian.net/browse/MODSOURMAN-469

https://folio-org.atlassian.net/browse/UIDATIMP-918

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3191

13

Stability/Reliability

SRS can fail when processing message during import

Import can end up creating some instances but not creating holdings/items for some MARC records

Generate "INSTANCE CREATED" from mod-inventory. Consume in SRS to update HRID in BIB and in INVENTORY to continue processing.

 

Remove unnecessary topics (* ready for post processing and hrid set)

Mid

 

L

https://folio-org.atlassian.net/browse/MODDATAIMP-500

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

14

Stability/Reliability

If we have infrastructure issue (like DB not available, module being restarted or network failure), we are sending DI_ERROR instead of retrying

Records that can potentially be processed during import are not processed if we have temporary infrastructure issues (DB down, network connectivity loss, etc)

Do not ACK messages in Kafka if there's not a logic, but infrastructure error/exception. Split failed processing results into 2 categories:

  1. IO errors - do not ack. retry until fixed

  2. Business logic - DI_ERROR and Ack current message

Mid

 

 

https://folio-org.atlassian.net/browse/MODDATAIMP-501

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

15

 

Consumer gets disconnected from Kafka cluster

Jobs get stuck until module restart

Need investigation

Mid

 

 

https://folio-org.atlassian.net/browse/MODINV-417

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

16

 

De-duplication of status messages for progress bar

Progress bar might display incorrect progress

De-duplicate status messages per-record while tracking progress

Mid

 

L (depends on 12)

https://folio-org.atlassian.net/browse/MODSOURMAN-522

https://folio-org.atlassian.net/browse/UXPROD-3135

https://folio-org.atlassian.net/browse/UXPROD-3193

Filters

key summary type created updated due assignee reporter priority status resolution
Loading...
Refresh

Issues to potentially remove from scope

https://folio-org.atlassian.net/browse/MODDATAIMP-410

https://folio-org.atlassian.net/browse/MODDATAIMP-430

https://folio-org.atlassian.net/browse/MODDATAIMP-444

https://folio-org.atlassian.net/browse/MODSOURCE-300

https://folio-org.atlassian.net/browse/MODSOURMAN-481

https://folio-org.atlassian.net/browse/MODSOURMAN-521

Links

Data Import Observations for Improvements