Switching from PubSub to Direct Kafka approach in Circulation

Introduction

This page is intended to analyze the current experience of using the PubSub mechanism, in particular in the Circulation application, and to study the feasibility of moving to the Direct Kafka approach.

Link to Jira - UXPROD-3764 - Getting issue details... STATUS

Additional information on the topic

Requirements for the mechanism of interaction between modules

The requirements below are based on current understanding and requirements from specific teams and modules using PubSub.

Reliability

Message delivery must be guaranteed. Message loss is unacceptable.

Performance

Vega team didn't have any issues with PubSub performance because the team mainly use it for events manually triggered by users (check-out, manual fee/fine charge etc.). But it can be a potential issue for libraries that use fixed schedules and have thousands of loans "ageing to lost" or being charged a fine for at the same time.

The PubSub's performance was an issue for Data Import. However, it does not seem that the requirements were explicitly fixed somewhere. According to the development team, "the reuirements were once voiced a long time ago", but I have not yet been able to find any documents. As for the current performance, the development team checks each dev cycle before releasing on a perf rancher with no background activity (for Morning Glory, the results are collected here Folijet - Morning Glory Snapshot Performance testing), plus the PTF team measures the performance of each release on their environments with background activity including (here is the report Data Import Test report (Lotus)).

Retention policy

It seems nothing specific.

Payload size

Large payload sizes is not expected. For Circulation flows (Vega team) this are small json structures (less than 1 Kb), for Folijet it is the same. For Firebird (Remote Storage) - usually each message transmits information on one item though in some cases there can be a batch of several items; There are no measurements from production systems, but according to the development team, the size of messages can be up to 10 Kb (for one item) or up to 100 Kb (for several items in a message).

Therefore, one can assume that the payload size is expected to be up to 100 Kb.

The existing scheme of modules interaction through PubSub

The figure below shows the participants in the transmission of a single event between two modules - Publisher and Subscriber - using the PubSub approach. (warning) A bit more details can be provided here

Benefits

This is the list of benefits PubSub approach provides:

  • modules decoupling - for interaction and transmission of messages, standard calls to OkApi are used
  • the potential to replace the underlying transport mechanism from Kafka with something else without having to refactor the client modules (i.e. the modules that use this PubSub)
  • ability to use FOLIO permissions to control access

Known limitations and issues

The description of known issues is based on production experience with PubSub in mod-circulation, mod-feesfines, and mod-patron-blocks, as well as results from performance testing mod-pubsub performance testing (it should be noted that this testing was conducted some time ago, and apparently there is no more recent data)

Most common pubsub issues Vega has faced:

  • missing Okapi permissions during calls from/to PubSub
  • issues with the special user PubSub creates and uses for its purposes (missing user, missing credentials, missing permissions, etc.)
  • missing modules' publisher/subscriber registration in PubSub
  • after failing to deliver an event (for any reason, including consumer's fault) PubSub just keeps delivering other events from the same topic and modules keep consuming them which ruins data consistency. Such issues very often go unnoticed for months and after that it can be hard to reproduce or find the reason of the initial delivery failures. On top of that, additional work is required to create syncing mechanisms to fix the data consistency issue.

Pubsub issues are notoriously time-consuming and hard to investigate. Mostly because they are usually invisible to the end user. When an event or a series of events can't reach the intended subscriber, libraries rarely notice this immediately, but rather when data inconsistencies caused by undelivered events manifest themselves elsewhere. Consider the following real-life scenario:

  • maximum number of loans per user is limited to 10 by automated patron blocks configuration
  • a user has no open loans at the moment
  • user checks out an item, but ITEM_CHECKED_OUT event does NOT reach mod-patron-blocks (which keeps count of loans for every user)
  • over the next few months user checks out 10 more items, each time a corresponding event reaches mod-patron-blocks successfully
  • library notices that user has 11 open loans, while the the limit is 10
  • library reports a bug in mod-patron-blocks - the most likely culprit from user's perspective
  • during investigation a developers discovers that the block was not imposed because of a failed event delivery which took place months ago

One more real-life example: Sometimes the system blocks the user due to the activation of patron blocks restrictions. In such cases, FSE receives support tickets in Jira. At the current time, there are several dozens of similar tickets from different libraries on this topic. According to the FSE team, in the vast majority of cases, the problem is solved by re-synchronization between mod-circulation and mod-patron-blocks, performed by running a special Jenkins job.

Assumption: Based on the information received from the FSE and Vega teams, their experience in analyzing the issues described and the available statistics, it is reasonable to assume that these issues are caused by the PubSub issues described above.

Consequences of the Push mechanism while Data Import

The existing PubSub is a Push mechanism. Source Record Manager would place large numbers of messages (one per record) into the queue during a large import job. Mod-pubsub would then push these into the callback function provided by mod-inventory. There was no means for mod-inventory to say “enough already”, it would get overloaded and crash. This was discussed with Folijet previously, and no viable solution was found.

The proposed scheme of modules interaction through Direct Kafka

In the case of Direct Kafka approach, OkApi and PubSub are no longer required, modules A and B interact directly with Kafka:

Requirements Addressing

Below the key benefits are listed:

  • Guaranteed delivery provided by Kafka allows addressing reliability concern
  • Improved data consistency since Kafka does not deliver newer messages until older ones are acknowledged
  • Better performance by eliminating the overhead of multiple HTTP calls per event dispatch
  • Enabling good HA since every new Event Consumer instance connects Kafka within a consumer group, so that the load is distributed evenly
  • Improved manageability because of easier investigation capabilities, less data inconsistency, and following fail-fast approach
  • the Pull mechanism provided by the Direct Kafka (as implemented in Data Import) - this implementation places the consumer code in mod-inventor,y and it will pull message from Kafka when it has capacity

Limitations, Risks and Assumptions

  • Configuration (including Kafka, topics, group consumer, authorization) is more complicated than with the PubSub
  • While Kafka supports exactly-once delivery, the at-least-once implementation is simpler and more manageable. In turn, at-least-once means that the Event Consumer must be prepared to handle potential duplicate events
  • All modules involved will have a Kafka client and "know" that Kafka is being used as the transport mechanism. As a result, if it is necessary to move to another transport mechanism in the indefinite future, changes will be required in all the modules involved. This risk can be partially mitigated by placing all the logic required to work through Direct Kafka in a separate library with designated interfaces. In this case, the logic of interaction through Direct Kafka will, in a sense, still be hidden from the business logic of the modules involved. Note: there is folio-kafka-wrapper which provides some useful functionality; for Spring-way it should be much easier
  • At the moment there's no implemented approach to address security concerns (including authorization) for Kafka - it will be required to follow some general solution when it'll be made
  • There are concerns about the 'informal' way of managing dependencies on specific versions of events given that any new subscriber could subscribe to any given published event - we don't know who all the subscribers are.

Modules affected

Below is the list of modules participating in Circulation where refactoring will be required:

Module nameOwning teamIs it a Producer or Consumer of events?
mod-circulationVegaProducer (in a number of flows), Consumer (for events from mod-feesfines)
mod-circulationVega
mod-feesfinesVegaProducer
mod-patron-blocksVegaConsumer
mod-auditFirebirdConsumer
mod-remote-storageFirebirdConsumer (question)

Wouldn't it make sense to tune the PubSub instead of switching to Direct Kafka?

A legitimate question is whether it is possible to refine and expand the capabilities of the PubSub in order to address the problems listed before, and whether this will be more (or less) efficient than switching to the Direct Kafka approach.

Most likely, such an improvement of the PubSub is possible. Although detailed elaboration has not been carried out, it can be assumed that the following will be required: persistent storage of events, reliable management of this storage, tracking of delivered/undelivered status, processing of delivery confirmations, and a number of others.

In fact, this means a lot of the same functionality that Kafka already provides. At the same time, the PubSub is not a key product for Folio, but only a transport mechanism.

Therefore, it seems more appropriate and efficient to use existing solutions from the market (in this case, Apache Kafka), and focus FOLIO development efforts on business value.

Time and effort estimates

Required efforts can be divided into two groups:

  • switching to folio-kafka-wrapper, reusing its capabilities and independently implementing missing functionality (in terms of creating and configuring topics, for example); a small spike-story will help to better understand the size of this group
  • transfer of all modules to the event-approach - a simpler and more understandable activity, because the essence of the process is the same everywhere.

Quite T-shirt estimates - L->XXL.

Spike scope

  • How to configure and connect to Kafka broker and topic?
  • How to create Kafka topics?
  • How many Kafka topics should be used - one topic per each event type, or one for all types?
  • ..