Kafka data exchange for data-import(DRAFT)

MODPUBSUB-117 - Getting issue details... STATUS

Current document is created to describe the solution for data flow between the modules using messaging system - Kafka - directly and eliminate unnecessary HTTP calls. mod-pubsub module can still be used as before (receiving and distributing events via HTTP), but its Client library functionality can be extended to allow modules to publish and consume messages directly from Kafka. The contract between mod-pubsub and other modules can remain as is - MessagingDescriptor contains the list of published and consumed events, registration of EventDescriptors should trigger the creation of topics in Kafka by mod-pubsub. 

Data-import process consists of various actions on multiple entities that take place across several modules. Interaction between these modules heavily relies on HTTP, which place a lot of pressure on FOLIO infrastructure. In essence, current data-import flow implementation via mod-pubsub looks like this:

(1), (2), (3) - are direct HTTP calls between modules while all the other interaction between modules happens via mod-pubsub, though still over HTTP. Modules register in mod-pubsub as Publishers and/or Subscribers, mod-pubsub creates necessary topics and works as a wrapper over Kafka. It provides an API for publishing events, writes data to Kafka, receives data from Kafka and distributes the events to all subscribed modules calling the specified callback endpoints. 

Data-import flow is quite complex, processing of data using the most trivial JobProfile (save MARC Bibliographic records, map to Instance, match by Instance fields, create Holdings, Items, etc.) results in multiple events being published for a single record. Following are the data-import events sent and received by the modules involved in data-import process so far (it will be extended, you can find the actualized table by the link): 

The fact that data-import event-driven interaction happens over HTTP places high requirements on modules in terms of memory and CPU and a 100% availability for the FOLIO infrastructure. The latter is difficult to achieve since there is no "retry" or "circuit breaker" policy implemented in FOLIO. Hence, infrastructure-related errors occur frequently and break the whole data-import flow. 

The simplest and quickest solution to resolve the underlined problem appears to be to allow modules involved in data-import process exchange data using messaging system - Kafka - directly. mod-pubsub Client library can provide methods to write event to Kafka and to create a Consumer with custom handler for each event while registration and topic creation remains a mod-pubsub responsibility. This would simplify the integration and would preserve the existing contract regarding EventDescriptors.

HTTP calls (1) and (2) from the previous diagram can also be reworked to use Kafka instead - DI_INITIAL_RECORDS_LOADED and DI_INITIAL_RECORD_PARSED events can be published respectively. Rework of the (1) call would also allow to discard the blocking queue implementation in mod-data-import, which has its own flaws and prevents to import multiple files in parallel. The solution with blocking queue was implemented to control the flow of data-import and prevent flooding the mod-source-record-manager and all the subsequent modules along the way, resulting in OOM errors. 

The proposed flow using Kafka would look like the following:

Interaction between mod-inventory and mod-inventory-storage can remain as is, but the event-driven processing would be more stable and efficient if happened via Kafka.