ARCH-14: Investigate opportunity to not take extended time for OCLC single record import while large file is importing

Reference JIRA issue:  ARCH-14 - Getting issue details... STATUS

The problem

In current implementation all Data Import tasks/messages are queued up in Kafka topics and processed by modules with maximum possible speed. It allows to fully utilize resources of the system.

On the other hand, all other processes suffer from resource starvation during imports and overall system latency increases. Also there's a need for manual operations, which utilize DI mechanisms. Those manual operations have to wait for all previous DI jobs to complete before they can be processed.

Proposed solution

Flow control

Jobs priority alone will not solve the issue since underlying kafka streams can be filled up with messages. So the flow control mechanism needs to be implemented.

The implementation is done in mod-source-record-manager:

  1. Introduce two configuration parameters (names are to be altered):
    1. di.flow.max_simultaneous_records – this parameter defines how many records can be processed by the system simultaneously. Default value is proposed to be set to 10*(kafka batch size)
    2. di.flow.records_threshold – this parameter defines how many records from previous batch must be processed before throwing new records to pipeline. Default value is proposed to be 5*(kafka batch size)
    3. since Kafka works with batches, those parameters should be set to be evenly divided by Kafka batch size. a validation could also be added for this rule
  2. Implement flow control mechanism based on parameters above:
    1. mod-srm counts total number of messages it passed to DI pipeline. This number of records should not exceed di.flow.max_simultaneous_records at any time.
    2. when <<DI_COMPLETE>> or <<DI_ERROR>> is received, record counter is decreased
    3. when counter reaches di.flow.records_threshold or less, more records are pushed to pipeline (from priority queue). Count of records pushed should respect di.flow.max_simultaneous_records
    4. those counters are per-tenant (since topics are also per-tenant)

Manual operations are pushed to the pipeline directly. They do not respect the parameters above.

This solution still does not allow manual operations to be processed "instantly" but it decreases the wait time from 5000-25000 records to a maximum of di.flow.max_simultaneous_records 

Considering average 5 min for 5000 records data import on PTF environment, the average wait time with default parameters will be 300 sec / 5000 records * 100 max_simultaneous_records in pipeline ~ 6 seconds

If expected latency is too high there's another solution for implementation to be considered:

Alternate Solution: Prioritized Kafka topics

Instead of creating a single topic for a command, system creates a group of similar topics each having defined priority.

The records are put to a topic according to their priority. The topics in modules are processed according to topics' priority.

This solution will allow to provide minimum possible latency for prioritized commands, but is also much more complex, time-consuming, and risky for all of data import, both to implement and maintain.


List of implementation stories for the Proposed Solution:

MODSOURMAN-661 - Getting issue details... STATUS

MODSOURMAN-662 - Getting issue details... STATUS

KAFKAWRAP-18 - Getting issue details... STATUS

MODSOURMAN-663 - Getting issue details... STATUS