ARCH-62 - INN-Reach: Record Contribution Enhancements
Initial contribution job
Context
Initial contribution have issues:
- Counting messages (makes code more complex than it needs to be and doesn’t work well). No understanding when the job is done. Cannot validate that sum of records processed is ok, sometimes Kafka offsets are greater then 0 when job is over
- D2IR API has "suspended mode" meaning that 2-3 hours a day it is not available
- D2IR API doesn't support batch record processing and is unstable from perspective of TPS
- The verification step (slows code down, produces 404s). Items can be contributed only after we validate that instance is contributed
- The use of two retry mechanisms is ugly. We need to choose one and right now the kafka mechanism is handling the suspension which is very important, so why not choose that? Verification was why but I don’t think that’s a good reason anymore.
NFR:
- scalable
- robust enough in order not to loose data
- transparent to an administrator
Stakeholders
Baseline architecture
- Kafka retries mechanism is used
- Consumer concurrency is used
- Retry template is used for communication with D2IR API
- Microservice persists state in
ConcurrentHashMap
to track progress of job implementation, which breaks microservice architecture approach
Sequence diagram:
Solution Options
Fine-tuning existing implementation
Approach:
- Determine the job execution end by querying kafka offset lag for the topic using admin-client
- Consume message in two big loops to address verification issue: first time to contribute instances, second for items
- Remove one of the retry mechanisms: either kafka-retry or sprint
retryTemplate
Pros:
- No architectural changes
Cons:
- Might require significant effort
- State exists in the microservice which is against the architectural approach.
- Usage of kafka admin client requires privileged access to kafka which raises security concern:
- Too many partitions of a consumer group and admin client has to collect and compute the data from all the partitions
- Application is not aware of the number of partitions
- The ConcurrentMessageListenerContainer is not a fit with AdminClient implementation.
- Also tried using ListenerContainerIdleEvent's Listener but it does not work with multi partitions.
Migration to Spring Integration Framework
Pros:
- Retry mechanism
- Circuit breaker pattern
- Rate limiter
Cons:
- Requires full redesign and re-implementation of module
Implementation of outbox pattern
Approach:
We need to split the job into two parts: consuming records from kafka to outbox, periodical processing of records in outbox. The latter can be done on timer, e.g.: once a second process can obtain X records from outbox table and contribute only those. For each record in outbox table it is required to save status. The status can be one of the following:
ready
: newly added events that are ready to be processedrejected
: events that are not eligible for contributionin progress
: instance is contributed and items are ready to be contributedprocessed
: instance and its items are contributedretry
: event was rejected by D2IR api due to unavailability of the latterfailed
: event that failed to be contributed after max retry attempts
There are two “modes” needed regarding retry logic. One is “suspension” from d2ir, and one is “verification”. Any other event we just log to the error table, so we don’t retry indefinitely. So no poison message possibility. Mentioned behavior can be achieved by setting status retry
in outbox table.
To implement concurrent contribution we can utilize spring scheduler
's built-in thread pool with @Async
annotation on contribution method execution (see docs). The recommended approach is to minimize interval for processing records and limit amount of records fetched for processing. To enable concurrent contribution we need to add @EnableAsync
configuration annotation and add pool-size to application.yml
:
spring: task: scheduling: pool: size: 20
To address verification issue it is reasonable to split ready
status into two:
ready_for_instances
- the record in outbox table is ready for instance contributionready_for_items
- the record in outbox table is ready for items contribution
The statuses above can be process by separate schedulers.
Overall pros and cons of outbox solution:
Pros:
- Scalable
- Would survive replica restart
- Minor change from perspective of architecture
- Doesn't rely on kafka retry mechanism, instead retry logic is reached by persisting the status in database
- Easy to monitor and track progress
Cons:
- Requires implementation of new table in PostgreSQL database
- Requires proper testing to configure period and chunk size
Sequence diagram:
Questions
Estimation
- 1-2 Sprints for senior developer
- Requires extended testing