Submitted	28 Apr 2023
Approved	Steve Ellis
Status	ACCEPTED
Impact	LOW

Initial contribution job

Context

Initial contribution have issues:

Counting messages (makes code more complex than it needs to be and doesn’t work well). No understanding when the job is done. Cannot validate that sum of records processed is ok, sometimes Kafka offsets are greater then 0 when job is over
D2IR API has "suspended mode" meaning that 2-3 hours a day it is not available
D2IR API doesn't support batch record processing and is unstable from perspective of TPS
The verification step (slows code down, produces 404s). Items can be contributed only after we validate that instance is contributed
The use of two retry mechanisms is ugly. We need to choose one and right now the kafka mechanism is handling the suspension which is very important, so why not choose that? Verification was why but I don’t think that’s a good reason anymore.

NFR:

scalable
robust enough in order not to loose data
transparent to an administrator

Stakeholders

Baseline architecture

Kafka retries mechanism is used
Consumer concurrency is used
Retry template is used for communication with D2IR API
Microservice persists state in ConcurrentHashMap to track progress of job implementation, which breaks microservice architecture approach

Sequence diagram:

baseline.puml

Solution Options

Fine-tuning existing implementation

Approach:

Determine the job execution end by querying kafka offset lag for the topic using admin-client
Consume message in two big loops to address verification issue: first time to contribute instances, second for items
Remove one of the retry mechanisms: either kafka-retry or sprint retryTemplate

Pros:

No architectural changes

Cons:

Might require significant effort
State exists in the microservice which is against the architectural approach.
Usage of kafka admin client requires privileged access to kafka which raises security concern:
1. Too many partitions of a consumer group and admin client has to collect and compute the data from all the partitions
2. Application is not aware of the number of partitions
3. The ConcurrentMessageListenerContainer is not a fit with AdminClient implementation.
4. Also tried using ListenerContainerIdleEvent's Listener but it does not work with multi partitions.

Migration to Spring Integration Framework

Pros:

Retry mechanism
Circuit breaker pattern
Rate limiter

Cons:

Requires full redesign and re-implementation of module

Implementation of outbox pattern

Approach:

We need to split the job into two parts: consuming records from kafka to outbox, periodical processing of records in outbox. The latter can be done on timer, e.g.: once a second process can obtain X records from outbox table and contribute only those. For each record in outbox table it is required to save status. The status can be one of the following:

ready: newly added events that are ready to be processed
rejected: events that are not eligible for contribution
in progress: instance is contributed and items are ready to be contributed
processed: instance and its items are contributed
retry: event was rejected by D2IR api due to unavailability of the latter
failed: event that failed to be contributed after max retry attempts

There are two “modes” needed regarding retry logic. One is “suspension” from d2ir, and one is “verification”. Any other event we just log to the error table, so we don’t retry indefinitely. So no poison message possibility. Mentioned behavior can be achieved by setting status retry in outbox table.

To implement concurrent contribution we can utilize spring scheduler 's built-in thread pool with @Async annotation on contribution method execution (see docs). The recommended approach is to minimize interval for processing records and limit amount of records fetched for processing. To enable concurrent contribution we need to add @EnableAsync configuration annotation and add pool-size to application.yml :

application.yml

spring:
	task:
		scheduling:
			pool:
				size: 20

To address verification issue it is reasonable to split ready status into two:

ready_for_instances - the record in outbox table is ready for instance contribution
ready_for_items - the record in outbox table is ready for items contribution

The statuses above can be process by separate schedulers.

Overall pros and cons of outbox solution:

Pros:

Scalable
Would survive replica restart
Minor change from perspective of architecture
Doesn't rely on kafka retry mechanism, instead retry logic is reached by persisting the status in database
Easy to monitor and track progress

Cons:

Requires implementation of new table in PostgreSQL database
Requires proper testing to configure period and chunk size

Sequence diagram:

target-outbox.puml

Questions

Estimation

1-2 Sprints for senior developer
Requires extended testing

Folio Development Teams

ARCH-62 - INN-Reach: Record Contribution Enhancements

Initial contribution job

Context

NFR:

Stakeholders

Baseline architecture

Solution Options

Fine-tuning existing implementation

Migration to Spring Integration Framework

Implementation of outbox pattern

Questions

Estimation

Decision

ARCH-62 - INN-Reach: Record Contribution Enhancements​

Initial contribution job

Context

NFR:

Stakeholders

Baseline architecture

Solution Options

Fine-tuning existing implementation

Migration to Spring Integration Framework

Implementation of outbox pattern

Questions

Estimation

Decision

ARCH-62 - INN-Reach: Record Contribution Enhancements