Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In KafkaProducer documentation, it is written , that the producer is thread safe, and sharing a single producer instance across threads will generally be faster than having multiple instances.

Also It's also written that producers should be closed (see documentation:  * Note: after creating a {@code KafkaProducer} you must always {@link #close()} it to avoid resource leaks.

In our Data Import application, in some modules producers are closing after usage and recreating again, such as mod-source-record-manager/mod-source-record-storage link to code

and in some modules not, such as mod-data-import link to code, and this . This can be a reason of for memory leak described in this ticket MODDATAIMP-465 Fix memory leaks after import. 

Thoughts regarding implementing pool of producers

The initial idea was to implement some a pool of Kafka producers something like a commonly-used pool of database connections, that should be configurable, allowing to specify min/max size.

Producer's method close() should be overridden to not physically close connection, just return it to pool. Vertex Vertx already provide some sort of reusing producers,

but we could not found find logic that returns back the producer after close() invocation, specifying min/max size and other setting applicable for pools implementation.

So the 2 options are applicable:

  • Implement own pool of Kafka producers. In this way producers should start at application startup and physically close at application end, they should be reusable as every pool implementation.

...

The disadvantages of this approach are obvious

...

. It's time consuming and

...

it's hard to achieve the same quality as some standard implementation at the market that are all very carefully tested in

...

many production applications.

  • Use Spring Kafka implementation. This approach can abstract us from implementation details and is easy to use. By default its , it's tuned with the most appropriate settings  settings, and at the same time is configurable, and we can tune it for our needs. It will simply simplify developer's work when concentration and allow them to concentrate more on business requirements, but not rather than supporting our own pool of connection. Also our plan is migration to migrate our modules to Spring one by one. and using Spring Kafka can be good starting point, moreover . Moreover our modules are already using Spring context for dependencies injections that will simplify introducing of Spring Kafka.

POC for introducing Spring Kafka

For this POC, producers from 2 modules was were overwritten to use Spring Kafka, its : mod-data-import and mod-source-record-manager, producers . Producers from other modules and all consumers remain to have with Vertx implementation.

I imported couple of files and imports were successful with such configuration.

...

PR for mod-data-import: https://github.com/folio-org/mod-data-import/pull/194

Module The module mod-data-import is harder and more risky to migrate, because it operates with read/write streams and piping between them in order to not store big chunks of data in memory, but send to Kafka step by step.

These write/read streams are from Vertx implementation, and are very coupled to Vertx KafkaProducer, . Spring Kafka Template could not implement all contract methods of Write stream, so its we will need to more deeply test this functionality.

For my testing I uploaded 30K file and connected to profiler to compare graphs of mod-data-import for old Vertx implementation and new Spring Template

Image Removed

                                                                                                                        Pic. 1 - Vertex implementation

Image Removed

                                                                                                                      Pic. 2 - Spring implementation

From these pictures of memory utilizations we see pick loads when files are dividing into chunks and sending to Kafka, and after it how Garbage collector works when mod-data-import is without load

Next steps

Currently some our modules close Kafka producers, some not, that can trigger different memory leaks. 

The new story has been created to address this:

Jira Legacy
serverSystem JIRA
serverId01505d01-b853-3c2e-90f1-ee9b165564fc
keyMODDATAIMP-557
and pulled to 124 sprint.

And regarding Vertx Kafka documentation: When you are done with the producer, just close it, when all shared producers are closed, the resources will be released for you,

so based on these words, when there are active threads, Vertx keeps producer open and does not invoke physical close() method that is desired for us(some sort of connection pooling).

In some future we need conversation when to migrate our modules to Kafka streams or using Spring Kafka Template, that described in this spike below.