ARCH-62 - INN-Reach: Record Contribution Enhancements
Submitted | Apr 28, 2023 |
|---|---|
Approved | @Steve Ellis |
Status | ACCEPTED |
Impact | low |
Initial contribution job
Context
Initial contribution have issues:
Counting messages (makes code more complex than it needs to be and doesn’t work well). No understanding when the job is done. Cannot validate that sum of records processed is ok, sometimes Kafka offsets are greater then 0 when job is over
D2IR API has "suspended mode" meaning that 2-3 hours a day it is not available
D2IR API doesn't support batch record processing and is unstable from perspective of TPS
The verification step (slows code down, produces 404s). Items can be contributed only after we validate that instance is contributed
The use of two retry mechanisms is ugly. We need to choose one and right now the kafka mechanism is handling the suspension which is very important, so why not choose that? Verification was why but I don’t think that’s a good reason anymore.
NFR:
scalable
robust enough in order not to loose data
transparent to an administrator
Stakeholders
@Steve Ellis
@Gurleen Kaur1
@Kalibek Turgumbayev
Baseline architecture
Kafka retries mechanism is used
Consumer concurrency is used
Retry template is used for communication with D2IR API
Microservice persists state in
ConcurrentHashMapto track progress of job implementation, which breaks microservice architecture approach
Sequence diagram:
Solution Options
Fine-tuning existing implementation
Approach:
Determine the job execution end by querying kafka offset lag for the topic using admin-client
Consume message in two big loops to address verification issue: first time to contribute instances, second for items
Remove one of the retry mechanisms: either kafka-retry or sprint
retryTemplate
Pros:
No architectural changes
Cons:
Might require significant effort
State exists in the microservice which is against the architectural approach.
Usage of kafka admin client requires privileged access to kafka which raises security concern:
Too many partitions of a consumer group and admin client has to collect and compute the data from all the partitions
Application is not aware of the number of partitions
The ConcurrentMessageListenerContainer is not a fit with AdminClient implementation.
Also tried using ListenerContainerIdleEvent's Listener but it does not work with multi partitions.
Migration to Spring Integration Framework
Pros:
Retry mechanism
Circuit breaker pattern
Rate limiter
Cons:
Requires full redesign and re-implementation of module
Implementation of outbox pattern
Approach:
We need to split the job into two parts: consuming records from kafka to outbox, periodical processing of records in outbox. The latter can be done on timer, e.g.: once a second process can obtain X records from outbox table and contribute only those. For each record in outbox table it is required to save status. The status can be one of the following:
ready: newly added events that are ready to be processedrejected: events that are not eligible for contributionin progress: instance is contributed and items are ready to be contributedprocessed: instance and its items are contributedretry: event was rejected by D2IR api due to unavailability of the latterfailed: event that failed to be contributed after max retry attempts
There are two “modes” needed regarding retry logic. One is “suspension” from d2ir, and one is “verification”. Any other event we just log to the error table, so we don’t retry indefinitely. So no poison message possibility. Mentioned behavior can be achieved by setting status retry in outbox table.
To implement concurrent contribution we can utilize spring scheduler 's built-in thread pool with @Async annotation on contribution method execution (see docs). The recommended approach is to minimize interval for processing records and limit amount of records fetched for processing. To enable concurrent contribution we need to add @EnableAsync configuration annotation and add pool-size to application.yml :
application.yml
spring:
task:
scheduling:
pool:
size: 20
To address verification issue it is reasonable to split ready status into two:
ready_for_instances- the record in outbox table is ready for instance contributionready_for_items- the record in outbox table is ready for items contribution
The statuses above can be process by separate schedulers.
Overall pros and cons of outbox solution:
Pros:
Scalable
Would survive replica restart
Minor change from perspective of architecture
Doesn't rely on kafka retry mechanism, instead retry logic is reached by persisting the status in database
Easy to monitor and track progress
Cons:
Requires implementation of new table in PostgreSQL database
Requires proper testing to configure period and chunk size
Sequence diagram:
Questions
Estimation
1-2 Sprints for senior developer
Requires extended testing