(DRAFT) MODDATAIMP-495 SPIKE: Analysis of the possibilities of implementing idempotence through monitoring processed records with the participation of the persistence level
The current page describes an approach to deduplicate messages that have been already processed by a module during data-import flow.
Problem statement
The data-import process is based on data exchange between modules which is performed through Kafka. In such distributed publish-subscribe messaging system there are cases of duplicate messages and re-processing them by the consumers due to producer failure, Kafka cluster failure, network error, or consumer processing failure.
Support of exactly-once message processing can be divided into two parts:
- avoiding duplication during data production;
- avoiding duplicates during data consumption.
Avoiding duplication during messages production can be ensured by Kafka functionalities (idempotent producer). Avoiding duplicates during data consumption can not be achieved by consumer configuration and requires some co-operation on the consumer side.
Solution overview
The idea is to store information on processed Kafka messages in a database for messages deduplication on receiving by modules. It is necessary to populate each Kafka message's headers with message unique identifier (e.g. UUID).
When a module consumes a message from Kafka at first it tries to save the message identifier and topic to the database table.
When the message is received for the first time, its unique identifier and topic name will be saved to the database and the message processing will be started.
If the message is received one more time (a duplicate), a module will check using a message id whether this message is a duplicate and was received again, and will skip the message processing.
Table structure to track processed messages
Column | Type | Constraints |
---|---|---|
id | uuid | PK |
topic | text | |
processing_date | timestamp |
Currently, all modules involved in data-import flow use folio-kafka-wrapper library for consuming messages from Kafka. Therefore, the logic for messages deduplication using a database can be extracted to this library.
The following modules are participating in data-import process and consume data from Kafka:
- mod-source-record-manager
- mod-source-record-storage
- mod-inventory
- mod-invoice
The described solution requires a module database. Since there is a module that doesn’t use a database (mod-inventory) so currently, in order of simplification and implementation it is proposed the following:
- implement idempotent behavior for CreateInstanceEventHandler for the case of instance creation from a duplicated event to send a message about successful event handling (MODINV-547 task was created)
- provide the same eventId for events related to certain srs record instead of providing a new one for each event publishing, partially could have been done in the scope of MODDATAIMP-491, it will allow distinguishing already processed messages on the side of handler for progress