Data Import Revision
This document will illustrate changes to data import to facilitate a reliable and performant system for ingesting MARC records etc into FOLIO.
Goals
Journey Simplification: The current flow involves back and forth communication between actors in the system with recurring expensive operations. Also means that the multiple actors makes troubleshooting harder, tracing the journey through multiple modules.
This will create avenue to allow complex functionality to be developed within a simple paradigm.Cross-Domain Flexibility: There are scenarios where actions in one domain e.g. orders is dependent on another domain e.g. inventory for perform a complete unit of work. Making that happen today is complicated via “post processing” and generally leaves room for only one dependent action to be performed i.e can only create a instance for the order, but inventory cannot interact with another domain cleanly within the purview of the original create order request.
Incoming Record Journey Logging: Show the journey of an incoming record in the journal.
Job Execution Status: Job execution can be broken at any actor but the status might not be communicated appropriately system wide. Provide structures that will allow good notification of job executions.
Error Codes: Introduce error codes so that user-friendly messages are shown to the user instead of coding exceptions.
Job Profile Validation: Increase validations performed on job profiles before persistence.
Multiple Results from Match Profile: Allow multiple results to be returned by a Match Profile in a Job Profile.
All design items are somewhat dependent on the Journey Simplification and so those design items are written assuming that the simplification already exists.
Design
Journey Simplification
The image above shows a system where a “Data Import Processor” is initialized in all the actors of the Data Import system with specific configuration that allow them to receive event from specific topics. For example, the Processor in mod-inventory would only listen to DI2_INVENTORY but not DI2_ORDERS. This approach will invert the dependency between data import and other FOLIO domains like orders & invoices. Rather than Data Import needed intimate knowledge of Inventory, Inventory is given an API to understand details about Data Import. All that is left on the Data Import side is to create a “runway” for the new FOLIO domain to be supported. Examples of a runway are DI2_INVENTORY, DI2_ORDERS; in the future maybe DI2_CIRCULATION.
Here is an example of the flow when creating instances with the revised Data Import
Here is another example of a flow when performing matching and sub-matching with the revised Data Import.
Protocol
There is a new event that will be used in all communication in the revised flow; DataImportProcessorEvent. It is compatible with the event used in the current flow; Event/DataImportEventPayload.
DataImportProcessorEvent:
id: String
type: ProcessorType
action: String
payload: Object
context: DataImportProcessorContext
ProcessorType:
- NONE
- INVENTORY
- INVOICES
- ORDERS
- LOGS
DataImportProcessorContext:
jobExecutionId: UUID
chunkId: UUID
recordId: UUID
userId: UUID
jobProfileSnapshotId: UUID
currentNode: ProfileSnapshotWrapper
jobProfileSnapshot: ProfileSnapshotWrapper
tenantId: String
token: String
okapiUrl: String
baggage:
type: HashMap
key: String
value: Object
errors:
type: List
items: Object
eventsChain:
type: List
items: String
ProfileSnapshotWrapper is an existing class in the current flow that is reused in the revised flow, hence it is not enumerated here.
Benefits
Better Performance: There will be less HTTP calls, Kafka messages and thrashing of SRS records.
Deploying the POC code to a local development machine; 10,000 instances are created in 5 minutes for the current flow and 3 minutes for the revised flow. The difference should be wider in a production like system. The current flow persists one CREATE & UPDATE of an instance together with a CREATE & UPDATE of a MARC record in SRS. The revised flow persists one CREATE of an instance with one CREATE of a MARC record in SRS. The revised flow does not produce intermediate Kafka messages used by the current flow.Easier Troubleshooting: Generally, interactions will be focused on certain modules rather than scattered across the system. This lessens the need to have more robust & costly observability practice.
Cleaner & Straightforward API For Data Import Implementers: The revised flow involves spinning up a Processor and assigning handlers for messages. A lot of the setup work executed by the Processor leaving Data Import Implementers to focus on their domain.
Foundation For More Flexibility of Job Profile: As mentioned earlier, this design lays the groundwork to allow other functionality to be easily implemented and less complex than if implemented in the current flow.
Enhanced Matching Functionality: Because a Processor will be responsible for activities dedicated to a FOLIO domain like inventory, matches and sub-matches are localized to the Processor. The current flow would involves communicating with SRS for MARC matching and Inventory for Instance matching. This localization will allow easier implementation for Match Profiles that allow multiple results and better performance since Kafka messages will not be sent between match and sub-match intents.
There will be a feature toggle in SRM that will decide which flow will be used to process incoming records.
Cross-Domain Flexibility
We want Data Import to allow Library staff to express their job profiles in an intuitive way with Data Import able to support such flows. Currently, there are hidden gotcha depending on the mix of Match and Action profiles and developers have been forced to hard code known paths raised as bugs by the community. There are long standing bugs that are hard to resolve as a result.
A special header containing a request identifier will be added to the Kafka message as well as a property set in the processor context before sending the request to the appropriate processor kafka topic. When a reply is obtained vi DI2_REPLY, processors will scan the headers to see if the reply is meant for a waiting request at the processor instance. This will allow efficient processing of replies by preventing deserialization of the Kafka message if the reply is not pertinent to the data import processor instance.
Data Import Context Keys
Since there is a higher chance of cross talk between Data Import processors. The Keys of the Context should be typed and documented to reduce collisions between processor business logic and express intent explicitly. Today, the context contains a bag of key value pairs with keys of String type, a processor could overwrite a context item if the same key is used unknowingly.
Incoming Record Journey Logging
It can be difficult to dig through logs while troubleshooting due to having too much log lines with no easy way to extract only the information needed. For Data Import Modules, we should ensure that the following parameters are present in the logs:
Okapi request id
Job execution id
record id
The following actions need to be performed to realize this goal
Ensure that okapi headers are propagated when performing module to module calls
Ensure that okapi headers are propagated via Kafka messages.
Ensure that FOLIO logging context that should contain okapi request id and other identifiers are initialized after receiving input from HTTP(okapi) or Kafka.
Modify log message patterns to conditionally include job execution id and record id. Request id should already be present with the standard FOLIO log pattern.
Log4j2 Log Pattern
The log pattern of a Data Import module can include %replace{$${FolioLoggingContext:di_jobexecutionid}}{^(?!$)(.*)}{jobExecutionId=$1}
. This will add jobExecutionId=<something>
to the logs only if di_jobexecutionid
is present in the log context.
Job execution id can be added to the log context like so:
The same approach can be done to enrich logs with record id as well.
Job Execution Status
It is difficult to control job executions because executors of Data Import are distributed. Cancelling a job at one executor does not mean other executors will follow suit. There are two major problem areas that has been observed in FOLIO that this design will account for:
Cancelled Jobs
A new Kafka topic will be introduced; DI2_JOB. SRM will post updates to job on this topic for other Data Import processors to consume. Updates will be published only on key state changes, not changes like incrementation of the job progress. When a job is cancelled, SRM will publish the latest snapshot of the job, data import processor could store a record of cancelled job in memory. When an event belonging to a cancelled job is encountered, the data import processor will skip the event.
Additionally, when a job execution is encountered and its status is not stored in memory, the processor will query SRM for the latest status of the job.
Stuck Jobs
Jobs that are not performing any duties is determined by the absence of events in DI2_COMPLETED & DI2_ERROR within a time frame. This will be managed by SRM. For example, if a message has not been received within 5 minutes for a job execution, the job execution should be marked as “Stalled”. This will inform Data Import users of an inconsistent state of their import.
Error Codes
Each Data Import Processor will have their own dictionary of error codes and user friendly message. Payload in DI2_ERROR will include objects instead of strings that will contain:
Code e.g. INVENTORY-001
Message e.g. “Instance could not be found”
Cause e.g. MatchingException or DuplicateRecordException
Codes and messages will be stored in properties files or similar at the Data Import Processor. SRM will record the error objects given. Errors that do not follow the standard will be encapsulated in standard error object like “Internal Server Error”. As time progresses, errors that are not covered by an error code are caught either via internal testing or community contributions.
Job Profile Validation
Validation of job profile only occurs in the frontend to some extent. Validation needs to be performed in the backend and also more stringent.
Validation will be delegated to processors and mostly not occur at mod-data-import-converter-storage(MDICS). MDICS will validate the general structure of the profile. Other FOLIO domain specific validations will occur at the Data Import Processor. This means the MDICS should not have intimate knowledge about inventory.
This means that mod-inventory will know that “Create Instance”, “Update holdings” is within its domain but nothing else like “Create Order”.
MDICS will make HTTP calls to a Processor so it won’t be like the typical data import flow. MDICS will have dependency declared in its module descriptor to any processor necessary. Processors dont have to conform to a URL standard but must be able to accept a Job Profile as input.
Multiple Results From Match Profile
With the bulk of processing occurring at the Processor, this can allow multiple results from match profiles to be easily implemented. One important restriction at the beginning is that match and sub-matches must remain in a FOLIO domain. For example, matching on instances then sub-matching on holdings.
Another difference is that matching on MARC records will occur at processors as well not SRS. This will allow matching on multiple source records and sub-matching on instances once more. Further sub-matches in the Inventory domain can follow.
Match profiles will have a configurable option to “Allow Multiple” matches. This will ensure flexibility and backwards compatibility with existing job profiles prior to this change. Multiples will not be allowed by default.