Batch Importer (Bib/Acq) (UXPROD-47)

[UXPROD-2659] NFR: Refactor data-import flow to increase reliability Created: 20/Jul/20  Updated: 28/Apr/21  Resolved: 28/Apr/21

Status: Closed
Project: UX Product
Components: None
Affects versions: None
Fix versions: R1 2021
Parent: Batch Importer (Bib/Acq)

Type: New Feature Priority: P2
Reporter: Kateryna Senchenko Assignee: Ann-Marie Breaux (Inactive)
Resolution: Done Votes: 0
Labels: data-import, epam-folijet, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Issue links:
Defines
defines UXPROD-47 Batch Importer (Bib/Acq) Analysis Complete
is defined by MODPUBSUB-114 Data Import stops when trying to load... Closed
is defined by MODDATAIMP-315 Use Kafka for data-import file proces... Closed
is defined by MODDICORE-82 Change transport layer implementation... Closed
is defined by MODINV-326 Refactor data-import handler to consu... Closed
is defined by MODINV-331 Upgrade to Vertx v3.9.4 (CVE-2019-17640) Closed
is defined by MODPUBSUB-118 Create sub-project in mod-pubsub for ... Closed
is defined by MODPUBSUB-120 SPIKE: Describe approach for data-imp... Closed
is defined by MODSOURCE-173 Refactor inventory-instance handler t... Closed
is defined by MODSOURCE-230 Deploy new Kafka approach to the Ranc... Closed
is defined by MODSOURMAN-336 Refactor created-inventory-instance h... Closed
is defined by MODSOURMAN-337 Refactor processing-result handler to... Closed
is defined by MODSOURMAN-338 Change chunk processing to use Kafka Closed
is defined by MODINV-373 Ensure exactly once processing for in... Closed
is defined by MODPUBSUB-136 Memory Leaks: HttpClients Closed
is defined by MODSOURCE-177 Change SRM-SRS interaction to use Kafka Closed
is defined by MODSOURCE-235 Ensure exactly once processing for SR... Closed
is defined by MODSOURMAN-400 Ensure exactly once processing for da... Closed
Gantt End to Start
has to be done after MODPUBSUB-122 Create a PoC with direct Kafka integr... Closed
Release: Q3 2020
Epic Link: Batch Importer (Bib/Acq)
Front End Estimate: Very Small (VS) < 1day
Front-End Confidence factor: Medium
Back End Estimate: XXXL: 30-45 days
Back End Estimator: Oleksii Kuzminov
Development Team: Folijet
PO Rank: 97
Rank: Chicago (MVP Sum 2020): R2
Rank: Cornell (Full Sum 2021): R2
Rank: 5Colleges (Full Jul 2021): R2
Rank: FLO (MVP Sum 2020): R1
Rank: GBV (MVP Sum 2020): R2
Rank: MO State (MVP June 2020): R1
Rank: TAMU (MVP Jan 2021): R2
Rank: U of AL (MVP Oct 2020): R1

 Description   

Steps

  • Inventory
    • Increment Vert.x version to 3.8.4+ in mod-inventory to support vertx-kafka-client
    • Check and fix marshaling \ unmarshaling for JSON Marc
    • Create Consumers for each evenType and subscribe di-processing-core
    • Add support for exactly one delivery for each Consumer
  • PubSub
  • Data-Import
    • Change mod-data-import file processing to Kafka approach(can be moved from PoC) https://github.com/folio-org/mod-data-import/pull/130
    • Create ProducerManager
    • Add support for exactly one delivery for each chunk(can be added a unique UUID or hash for each chunk). Add common schema with eventId and extend all kafka created enties with this id. For first time add a stub interface method isProcessed
  • Source-Manager
    • Change chunk processing to Kafka approach(can be moved from PoC) https://github.com/folio-org/mod-source-record-manager/pull/315
    • Add support for exactly one delivery for each chunk. JobExecutionSourceChunk can be reused. UUID will be received from mod-data-import. On constraining violations - skip chunk processing and add logs.
    • Recieve answers from SRS and start processing in StoredMarcChunkConsumersVerticle and add exactly one delivery for each chunk.
    • Create consumers for DI_COMPLETED DI_ERROR and finish data-import (can be moved from PoC)
    • Move "secret button" functionality on Kafka approach (interactions between SRM-SRS)
  • Source-Storage
    • Add consumers for initial records load(before processing) and save chunks in batch. (can be moved from PoC) https://github.com/folio-org/mod-source-record-storage/pull/214
    • Add support for exactly one delivery for each chunk with records. Add the new entity to track chunk duplications. On constraining violations - skip chunk processing and add logs.
    • Add consumers to process created\updated entities and fill 999 and 001 fields(can be moved from PoC)
    • Add support for exactly one delivery for each Consumer
  • Processing-Core
    • Change transport implementation to direct Kafka approach and reuse new sub-module lib from mod-pubsub.
    • Check vert.x version and update if needed.
    • Request producer from pub-sub-utils?
  • Error handling for consumers
    Taras Spashchenko will create pubsub stuff and error handling

Change SRM DB approach. For now, it is a bottleneck for performance - move to R2 and create another one feature (should be smaller than this feature; same as SRS, plus will need migration scripts.

Yellow = partly done
Green = done

Notes on maximum file size from Data Import Subgroup Sept 2020

  • For the PubSub/Kafka reconfig, max file size should be 500K records
  • But if we need interim, OK to use 100K, so long as there’s a clear understanding of when we’ll be able to increase to 500K.
  • Librarians are sending A-M a couple of the large files – 300K records for a large eBook collection, 1.4M records that all had to be updated with URL notes when the library closed for COVID
    Ann-Marie Breaux can you provide example files with 300-500k records, put on Google drive and add links in description?


 Comments   
Comment by Oleksii Kuzminov [ 05/Aug/20 ]

Kateryna SenchenkoAnn-Marie Breaux I changed status to draft. After the final PoC and approvals, we can adapt this umbrella and stories to the new requirements.

Comment by Ann-Marie Breaux (Inactive) [ 05/Aug/20 ]

Sounds good Oleksii Kuzminov Thank you!

Comment by Marc Johnson [ 12/Aug/20 ]

Change SRM DB approach. For now, it is a bottleneck for performance

What aspect of the source record storage database approach is a bottleneck?

Comment by Ann-Marie Breaux (Inactive) [ 15/Sep/20 ]

Hi Oleksii Kuzminov I changed UXPROD-2659 Closed to this UXPROD feature and set the fix version to R1 2021. Since it's a feature now, it needs backend/Front end T-shirt sizes. I added front end estimate of Very Small just so that it would show (I know it's probably 0 for front end). Could you or Kateryna Senchenko add a T-shirt size for the backend estimate?

Thank you!

Comment by Marc Johnson [ 15/Sep/20 ]

Ann-Marie Breaux Does that mean that the approach outlined in this issue has been agreed and development will start on it?

Comment by Ann-Marie Breaux (Inactive) [ 15/Sep/20 ]

Taras Spashchenko and VBar Are you comfortable with the path forward on this, or should we seek review/approval from the broader FOLIO tech community?

Comment by Taras Spashchenko [ 19/Sep/20 ]

the steps are Ok. and we can proceed with detailed stories and implementation.

Comment by Marc Johnson [ 21/Sep/20 ]

Taras Spashchenko

the steps are Ok. and we can proceed with detailed stories and implementation.

Does that mean that folks like the Technical Leads / Technical Council will not have an opportunity to provide feedback on this change?

Comment by Taras Spashchenko [ 21/Sep/20 ]

Marc Johnson, this is the internals of Data-Import, it is not a platform-wide change, not sure that it makes sense to bring it to the TL or TC, to be honest, it will take a lot of time. But based on the completed PoC it will bring required reliability with quite good performance, and as far as I know, we do not have a real alternative, that could be implemented with a reasonable effort to achieve the same results.

Comment by Marc Johnson [ 21/Sep/20 ]

Taras Spashchenko

this is the internals of Data-Import, it is not a platform-wide change

There is every chance that I am misunderstanding the scope of this work. Is this the work that changes the integration between the various modules involved in data import from using the HTTP API provided by mod-pubsub to using Kafka directly?

Comment by Taras Spashchenko [ 21/Sep/20 ]

Yes, you are right, the solution for Data import is to change the interaction between components from Http to direct Kafka connections. But it is not a substitution for Http interaction nor Pubsub that is proposed for the whole platform.

Comment by Marc Johnson [ 21/Sep/20 ]

Taras Spashchenko

Yes, you are right, the solution for Data import is to change the interaction between components from Http to direct Kafka connections. But it is not a substitution for Http interaction nor Pubsub that is proposed for the whole platform.

Thank you for confirming the scope of this work. I had thought this design was intended to be shared with a broader audience for feedback.

I imagine I might have a different sense of what is considered a significant technical decision. I think a side-effect of this work is that Kakfa moves from being a design decision of mod-pubsub to a platform level capability that modules can use (much like how introducing mod-pubsub for the first generation of data import made it available to other modules). To me, even if we consider the changes to data import itself to not be a significant design decision, I think this change in the emphasis and visibility of Kafka to be a significant architectural change.

I don't know if we want to explore that topic on this issue.

cc: Jakub Skoczen Craig McNally VBar

Comment by Ann-Marie Breaux (Inactive) [ 28/Sep/20 ]

Kateryna Senchenko Oleksii Kuzminov Taras Spashchenko The Capacity Planning Team is already starting to plan for R1 2021. The sooner we can get these draft stories changed to open, add any other necessary stories, and have a t-shirt size for backend, the better. Please let me know if there's anything I can do to help. Thank you!

Comment by Ian Walls [ 29/Sep/20 ]

I would agree with Marc Johnson on this... direct access to Kafka, instead of via mod-pubsub, makes it a core piece of the platform. I'd be in favor of it; I think having messaging built in "close to the ground" would have lots of utility for the project and make extensibility much easier. But it is a significant choice to make.

Comment by Ann-Marie Breaux (Inactive) [ 03/Dec/20 ]

Hi Oleksii Kuzminov At a check-in today, EBSCO was emphasizing that once this work is done, we should get with the Performance Task Force to check a couple standard scenarios, especially with regards to very large files being imported. Should we include a couple tasks in this feature to account for this?

Generated at Fri Feb 09 00:25:47 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.