DR-000029 - Data consistency and message driven approach

Submitted Date	30 Jun 2021
Approved Date
Status	DRAFT
Impact	MEDIUM

Overrides/Supersedes

This decision was migrated from the Tech Leads Decision Log as part of a consolidation process. The original decision record can be found here.

RFC

N/A

Stakeholders

Front-end and back-end devs who meet issues with data consistenc

Contributors

Raman Auramau

Approvers

Background/Context

This is a solution design document aimed to provide details, alternatives and decision for FOLIO cross-module data consistency problem.

Data consistency refers to whether the same data kept at different places do or do not match. Being a distributed microservices-based platform, FOLIO is a number of separate modules with own data schemas and storages. So, FOLIO follows the principle to keep each module’s persistent data private to that module and accessible only via its API. A module’s transactions only involve its database.

With that, this approach has some drawbacks, among them -

Implementing business transactions that span multiple services is not straightforward,
Implementing queries that join data that is now in multiple databases is challenging.

ARCH-5 - Spike : Data consistency approach for Folio OPEN

Basing on currently known Folio issues with Data consistency one can see that FOLIO has difficulties caused by both mentioned shortcomings. Those difficulties can be divided into the following groups:

PRIORITY 1 eventual consistency for redundant / duplicated data when some data is duplicated in 2 storages and is to be synchronized,
PRIORITY 2 consistency for dangling / lost references when an item is deleted from one module leaving lost references to it in other modules, a problem that is succinctly, if frustratingly, captured in the PR discussion related to UITEN-128.
PRIORITY 3 data consistency during distributed business operations when data in several separate storages is to be modified (mod-finance-storage, mod-invoice-storage, mod-orders-storage),
updates collisions ... (check with Jacub? )

Eventual consistency for duplicated data

Brief context: a source module owns an entity; a particular entity field is duplicated into 1+ entities of another module (e.g., for search, or filtering, or sorting). If an original field value in source module is changed, the change is to be replicated everywhere.

Identified cases:

Pair of RefNumber and RefType should be in consistence state between POL and invoice line MODORDERS-421 - Spike : User should be able to edit Pair of "refNumber" and "refNumberType" in the POL and see that update on the Invoice line OPEN
1. mod-orders → mod-invoice
2. 1-to-1 relation - one pair of refNumber/refNumberType to one invoice record ( - not sure, need to confirm)
VendorCode should be in consistence state between Organization record and purchaseOrder.vendorCodeMODORDERS-398 - Data consistency needed : Update "vendorCode" in related purchase orders BLOCKED
1. mod-organizations → mod-orders
2. 1-to-many relation - one vendor code can be used in many orders
FundCode should be in consistence state between Fund record and pol.fundDistribution.code
1. mod-finance → mod-orders
2. 1-to-many relation - one fund code can be used in many orders

Outstanding questions

Status	Item	Details, comments, decisions
	Raman Auramau How many data (rows) can be affected?	Raman Auramau Assumption is up to dozens of thousand (e.g. changing of vendor or fund code can affect thousands+ orders
	Raman Auramau Are there any specific performance requirements (i.e. how fast data are to be synchronized)?
	Raman Auramau What is an expected behavior in case synchronization fails? Options - rollback, continue from the same record, retry (1..N times), report an error
	Raman Auramau What is the allowable lag between changing a value in module-source and updating it in module-recipient?

Assumptions

N/A

Constraints

N/A

Rationale

Evaluated Alternatives

Alternative	Reasoning
API Composition is a pattern for micro-service-based platforms for implementing queries that span services. In this approach the application performs the data join rather than the database. For example, a service (or the API gateway) could retrieve a customer and their orders by first retrieving the customer from the customer service and then querying the order service to return the customer’s most recent orders.	Redundant data are currently used for both visualization and filtering / search. API Composition approach likely will add additional internal API calls which might impact on overall performance and efficiency. Also, certain re-thinking of work flow will be required.
Command Query Responsibility Segregation - maintain one or more materialized views that contain data from multiple services. The views are kept by services that subscribe to events that each services publishes when it updates its data. For example, the online store could implement a query that finds customers in a particular region and their recent orders by maintaining a view that joins customers and orders. The view is updated by a service that subscribes to customer and order events.	The views mechanism requires certain efforts for implementation and resources for keeping views in an actual state. Meanwhile currently known use cases are pretty straightforward. So, this approach seems to be to complex for this particular case.
Domain-event pattern for change notifications, and data normalizationto simplify data synchronization - implement a notification channel for easy (though guaranteed) delivery of changes, and improve data normalization to achieve 1-to-1 updates.	Enable keeping current FOLIO approach for certain data redundancy while solve eventual consistency issue with minimum efforts.

The solution consists of 2 parts -

implementation a robust notification channel with guaranteed delivery to transfer change events from modules-sources to modules-recipients, and
reaction to the described event and guaranteed field updates.

Notification channel

There should be a general notification delivery channel for transferring domain data event from modules-sources to modules-recipients. The channel must have characteristics such as guaranteed delivery with at-least-one semantic, as well as data persistence. Direct Kafka approach is recommended in this solution since it provides mentioned characteristics.

So, a module-source should be able to track changes in data it owns, and publish such changes as a domain event to a specified Kafka topic. In turn, a module-recipient should be able to connect to a specified Kafka topic, consume domain events and handle them appropriately.

Since as per identified use cases at least several modules can act as a sources (or recipient) it makes sense to implement mentioned logic as a small library with as much common code as possible, and re-use in it each of current or further modules to decrease efforts and speed-up implementation.

Many modules-sources publish their domain events into Kafka topic aka FOLIO Domain Event Bus specifying source, event type (created, updated, moved, deleted etc.), unique identifier of changed record, other details if need. Modules-recipients connect to that topic with different consumer groups and process events.

Note that in Kafka, there is no explicit limit on the number of consumer groups that can be instantiated for a particular topic. However, you should be aware that the more the consumer groups, the bigger the impact on network utilization (http://kafka.apache.org/090/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html).

Option 1 - Data normalization

RA 27 Apr 2021 - Is it possible to use this approach for json-b structure and to sort by vendor_code located in another table? How this might affect query performance?

17 May 2021 Raman A

Quick POC demonstrated that it's not possible to use sort SQL operator in these conditions with JSON-B.

So, this proposal is only possible in case of migration from JSON-B to standard relational structure for all modules-recipients.

It worth noting that according to some performance analysis conducted by Taras Spashchenko (not sure if they were formally documented ) the relational structure demonstrates much better performance than JSON-B one.

With that, migration to standard relational structure looks to be efficient though it takes certain efforts for refactoring and testing.

At the moment, records in modules-recipients contain redundant data explicitly, i.e. an order (can) contains an explicit vendor code or fund code. This approach admits a 1-to-many relation when one and the same code in module-source is used in many records in module-recipient. Therefore, a change of one value requires updates of many records which raises up such concerns as update consistency and synchronization lag.

The solution is to slightly improve data normalization, transform relations to 1-to-1 pattern, and consequently significantly improve overall experience.

For that it's proposed to move explicit redundant data from main data of modules-recipients to new separate tables-vocabularies, and add relations between them via unique identifiers, as it shown on the picture below:

Data processing (search, filtration etc.) then should be implemented on DB engine via join operation. Keeping in mind that a table-vocabulary is expected to have a small number of records, it can be easily indexed so that the overall join should not impact negatively on queries performance.

This approach allows to isolate external data (which can be potentially changed at any point of time) into a separate table-vocabulary. In turn, in case of changes

it's only required to update such table-vocabulary without any impact on other data,
it does not need to update dozens or thousands of records but the only one,
it can also potentially improve database usage (like disk space or indexing).

In cases with 1-to-1 relations the described additional normalization is likely not required.

Work breakdown draft structure

Design data model for improved normalization
Implement a migration script (either SQL or Liquibase)
Test migration script
Update existing business logic to enable new data model (most likely - update SQL statements to support join)
Test updated business logic
Add support of domain event and Kafka client in module-source
Add support of domain event and Kafka client in module-recipient
Implement a logic to handle domain event and update data model

Option 2 - De-normalized data processing

It's still possible to handle all the changes even without explicit normalization. For that it's proposed to split a reaction to change event in 2 phases. Phase 1 is to receive an event, query a data storage to retrieve a full list of items to be updates, and push information (at least its ID) of every item as a separate message to a Kafka topic. Phase 2 is to receive all such messages one by one (by a single processor or by a pool of them for scalability and performance) and make required update.

Consistency for Dangling References

Deletion of core-module records may leave dangling references from non-core modules

UICR-125 - Data corruption. When holding/item data are moved in Inventory, then the item in Courses is not updated accordinglyOPENMODORDERS-642 - Data corruption. When holding/item data are moved in Inventory, then the connected Order lines are not updated accordingly IN REFINEMENTUIREQ-589 - [BE] Data corruption. When holding/item data are moved in Inventory, then the connected Request is not updated accordingly OPENUIU-2082 - [BE] Data corruption. When holdings/item data are moved in Inventory, then the connected Fee/Fine is not updated accordingly CLOSED

This case seems to be similar from some perspective to the case with eventual consistency for duplicated data - a robust notification channel is required to notify modules-recipients about changes happened in modules-sources.

Consistency for Distributed Business Operations

Jira tickets: ARCH-5 - Spike : Data consistency approach for Folio OPEN

FOLIO have applied the Database per Service pattern. Each service has its own database. Some business transactions, however, span multiple service so they need a mechanism to implement transactions that span services. Options:

Two-Phase Commit (2PC) or Distributed Transactions
Another generic pattern for addressing mentioned needs is saga approach. A saga is a sequence of local transactions. Each local transaction updates the database and publishes a message or event to trigger the next local transaction in the saga. If a local transaction fails because it violates a business rule then the saga executes a series of compensating transactions that undo the changes that were made by the preceding local transactions.

Decision

Implications

Pros
- N/A
Cons
- N/A

Other Related Resources

References

On distributed updates and eventual consistency

2021-05-12 Meeting notes - there was a lot of discussion on related issues

Meetings notes

19 May 2021 Agenda

Review current context of FOLIO Data Consistency
Review identified groups of Data Consistency issues - are there any to be added?
- added a new one for update collisions
What are the priorities?
Review a proposed solution for Eventual consistency for redundant data
Identify next steps

30 Jun 2021 Tech Leads meeting

Communicate the proposed solution for Eventual consistency for redundant data on Tech Leads meeting
Next steps from my end - no objections from Tech Leads to go ahead
- I do propose to consider a FOLIO Domain Event Bus on top of Kafka as a potential pattern (without strong committing on this right now)
- Need to work though this in all the details during implementation for a particular case (Thunderjet team has some examples)
- After that - document and review the design and implementation, and present on Tech Leads meeting as a detailed design
Action items:
- Sync with Charlotte W regarding the cross apps consistency
- Chat with Mikhail F and Vladimir S regarding Kafka usage including topic naming conventions etc.
- Meet with Thunderjet team and Dennis B to agree on capacity and planning
- Raman to continue design work and keep in mind Jacub's proposal (see in comments)

06 Jul 2021 Thunderjet grooming session

Raman shared current status and suggested plan; Thunderjet is ok to go ahead
UIREC-135 - Update create, edit and receive piece forms with additional fields CLOSED - this is a top priority issue with data consistency for duplicated data which is necessary for current Thunderjet feature completion; agree to start with this issue
Raman to work with Andrei Makaranka on details