Recommendations for using Kafka producers
In KafkaProducer documentation written, that the producer is thread safe and sharing a single producer instance across threads will generally be faster than having multiple instances.
Also written that producers should be closed regarding this documentation: * Note: after creating a {@code KafkaProducer} you must always {@link #close()} it to avoid resource leaks.
In our application in some modules producers are closing after usage and recreating again, such as mod-source-record-manager/mod-source-record-storage link to code
and in some modules not, such as mod-data-import link to code, and this can be a reason of memory leak described in this ticket MODDATAIMP-465 Fix memory leaks after import.
Thoughts regarding implementing pool of producers
The initial idea was to implement some pool of Kafka producers something like commonly used pool of database connections, that should be configurable allowing to specify min/max size.
Producer's method close() should be overridden to not physically close connection, just return it to pool. Vertex already provide some sort of reusing producers,
but we could not found logic that returns back producer after close() invocation, specifying min/max size and other setting applicable for pools implementation.
So the 2 options are applicable:
- Implement own pool of Kafka producers. In this way producers should start at application startup and physically close at application end, they should be reusable as every pool implementation.
The disadvantages of this approach are obvious, its time consuming and its hard to achieve the same quality as some standard implementation at the market that are all very carefully tested in many
production applications.
- Use Spring Kafka implementation. This approach can abstract us from implementation details and easy to use. By default its tuned with the most appropriate settings and at the same time is configurable and we can tune it for our needs. It will simply developer's work when concentration more on business requirements, but not supporting own pool of connection. Also our plan is migration our modules to Spring one by one and using Spring Kafka can be good starting point, moreover our modules already using Spring context for dependencies injections that will simplify introducing of Spring Kafka.
POC for introducing Spring Kafka
For this POC producers from 2 modules was overwritten to use Spring Kafka, its mod-data-import and mod-source-record-manager, producers from other modules and all consumers remain to have Vertx implementation.
I imported couple of files and imports were successful with such configuration.
PR for mod-source-record-manager: https://github.com/folio-org/mod-source-record-manager/pull/492
PR for mod-data-import: https://github.com/folio-org/mod-data-import/pull/194
Module mod-data-import is harder to migrate, because it operates with read/write streams and piping between them in order to not store big chunks of data in memory, but send to Kafka step by step.
These write/read streams are from Vertx implementation are very coupled to Vertx KafkaProducer, Spring Kafka Template could not implement all contract methods of Write stream, so its need to more deeply test this functionality.
For my testing I uploaded 30K file and connected to profiler to compare graphs of mod-data-import for old Vertx implementation and new Spring Template
Pic. 1 - Vertex implementation
Pic. 2 - Spring implementation
From these pictures of memory utilizations we see pick loads when files are dividing into chunks and sending to Kafka, and after it how Garbage collector works when mod-data-import without load.