MODDATAIMP-499 SPIKE: Use active Kafka producer from the pool for sending messages

Recommendations for using Kafka producers

In KafkaProducer documentation, it is written that the producer is thread safe, and sharing a single producer instance across threads will generally be faster than having multiple instances.

It's also written that producers should be closed (see documentation  * Note: after creating a {@code KafkaProducer} you must always {@link #close()} it to avoid resource leaks.

In our Data Import application, in some modules producers are closing after usage and recreating again, such as mod-source-record-manager/mod-source-record-storage link to code

and in some modules not, such as mod-data-import link to code. This can be a reason for memory leak described in this ticket MODDATAIMP-465 Fix memory leaks after import. 

Thoughts regarding implementing pool of producers

The initial idea was to implement a pool of Kafka producers something like a commonly-used pool of database connections, that should be configurable, allowing to specify min/max size.

Producer's method close() should be overridden to not physically close connection, just return it to pool. Vertx already provide some sort of reusing producers,

but we could not find logic that returns back the producer after close() invocation, specifying min/max size and other setting applicable for pools implementation.

So 2 options are applicable:

  • Implement own pool of Kafka producers. In this way producers should start at application startup and physically close at application end, they should be reusable as every pool implementation.

The disadvantages of this approach are obvious. It's time consuming and it's hard to achieve the same quality as some standard implementation at the market that are all very carefully tested in many production applications.

  • Use Spring Kafka implementation. This approach can abstract us from implementation details and is easy to use. By default, it's tuned with the most appropriate settings, and at the same time is configurable, and we can tune it for our needs. It will simplify developer's work and allow them to concentrate more on business requirements, rather than supporting our own pool of connection. Also our plan is to migrate our modules to Spring one by one. and using Spring Kafka can be good starting point. Moreover our modules are already using Spring context for dependencies injections that will simplify introducing of Spring Kafka.

POC for introducing Spring Kafka

For this POC, producers from 2 modules were overwritten to use Spring Kafka: mod-data-import and mod-source-record-manager. Producers from other modules and all consumers remain with Vertx implementation.

I imported couple of files and imports were successful with such configuration.

PR for mod-source-record-manager: https://github.com/folio-org/mod-source-record-manager/pull/492

PR for mod-data-import: https://github.com/folio-org/mod-data-import/pull/194

The module mod-data-import is harder and more risky to migrate, because it operates with read/write streams and piping between them in order to not store big chunks of data in memory, but send to Kafka step by step.

These write/read streams are from Vertx implementation, and are very coupled to Vertx KafkaProducer. Spring Kafka Template could not implement all contract methods of Write stream, so we will need to more deeply test this functionality.

Next steps

Currently some our modules close Kafka producers, some not, that can trigger different memory leaks. 

The new story has been created to address this: MODDATAIMP-557 - Getting issue details... STATUS and pulled to 124 sprint.

And regarding Vertx Kafka documentation: When you are done with the producer, just close it, when all shared producers are closed, the resources will be released for you,

so based on these words, when there are active threads, Vertx keeps producer open and does not invoke physical close() method that is desired for us(some sort of connection pooling).

In some future we need conversation when to migrate our modules to Kafka streams or using Spring Kafka Template, that described in this spike below.