ARCH-62 - INN-Reach: Record Contribution Enhancements​

Submitted

 

ApprovedSteve Ellis 
Status

ACCEPTED

Impact

LOW

Initial contribution job

Context

Initial contribution have issues:

  1. Counting messages (makes code more complex than it needs to be and doesn’t work well). No understanding when the job is done. Cannot validate that sum of records processed is ok, sometimes Kafka offsets are greater then 0 when job is over
  2. D2IR API has "suspended mode" meaning that 2-3 hours a day it is not available
  3. D2IR API doesn't support batch record processing and is unstable from perspective of TPS
  4. The verification step (slows code down, produces 404s). Items can be contributed only after we validate that instance is contributed
  5. The use of two retry mechanisms is ugly. We need to choose one and right now the kafka mechanism is handling the suspension which is very important, so why not choose that? Verification was why but I don’t think that’s a good reason anymore.

NFR:

  1. scalable
  2. robust enough in order not to loose data
  3. transparent to an administrator

Stakeholders

Baseline architecture

  1. Kafka retries mechanism is used
  2. Consumer concurrency is used
  3. Retry template is used for communication with D2IR API
  4. Microservice persists state in ConcurrentHashMap to track progress of job implementation, which breaks microservice architecture approach

Sequence diagram:

baseline.puml

Solution Options

Fine-tuning existing implementation

Approach:

  • Determine the job execution end by querying kafka offset lag for the topic using admin-client
  • Consume message in two big loops to address verification issue: first time to contribute instances, second for items
  • Remove one of the retry mechanisms: either kafka-retry or sprint retryTemplate 

Pros:

  1. No architectural changes

Cons:

  1. Might require significant effort
  2. State exists in the microservice which is against the architectural approach.
  3. Usage of kafka admin client requires privileged access to kafka which raises security concern:
    1. Too many partitions of a consumer group and admin client has to collect and compute the data from all the partitions
    2. Application is not aware of the number of partitions
    3. The ConcurrentMessageListenerContainer is not a fit with AdminClient implementation.
    4. Also tried using ListenerContainerIdleEvent's Listener but it does not work with multi partitions.

Migration to Spring Integration Framework

Pros:

  1. Retry mechanism
  2. Circuit breaker pattern
  3. Rate limiter

Cons:

  1. Requires full redesign and re-implementation of module

Implementation of outbox pattern

Approach:

We need to split the job into two parts: consuming records from kafka to outbox, periodical processing of records in outbox. The latter can be done on timer, e.g.: once a second process can obtain X records from outbox table and contribute only those. For each record in outbox table it is required to save status. The status can be one of the following:

  • ready: newly added events that are ready to be processed
  • rejected: events that are not eligible for contribution
  • in progress: instance is contributed and items are ready to be contributed
  • processed: instance and its items are contributed
  • retry: event was rejected by D2IR api due to unavailability of the latter
  • failed: event that failed to be contributed after max retry attempts

There are two “modes” needed regarding retry logic. One is “suspension” from d2ir, and one is “verification”. Any other event we just log to the error table, so we don’t retry indefinitely. So no poison message possibility. Mentioned behavior can be achieved by setting status retry  in outbox table.

To implement concurrent contribution we can utilize spring scheduler 's built-in thread pool with @Async  annotation on contribution method execution (see docs). The recommended approach is to minimize interval for processing records and limit amount of records fetched for processing. To enable concurrent contribution we need to add @EnableAsync  configuration annotation and add pool-size to application.yml :

application.yml
spring:
	task:
		scheduling:
			pool:
				size: 20


To address verification issue it is reasonable to split ready  status into two:

  1. ready_for_instances - the record in outbox table is ready for instance contribution
  2. ready_for_items - the record in outbox table is ready for items contribution

The statuses above can be process by separate schedulers.

Overall pros and cons of outbox solution:

Pros:

  1. Scalable
  2. Would survive replica restart
  3. Minor change from perspective of architecture
  4. Doesn't rely on kafka retry mechanism, instead retry logic is reached by persisting the status in database
  5. Easy to monitor and track progress

Cons:

  1. Requires implementation of new table in PostgreSQL database
  2. Requires proper testing to configure period and chunk size

Sequence diagram:

target-outbox.puml

Questions

Estimation

  1. 1-2 Sprints for senior developer
  2. Requires extended testing

Decision