Design of XML importing APIs in mod-inventory-update

Design of XML importing APIs in mod-inventory-update

Mod-inventory-update (MIU) is a module for updating FOLIO’s inventory storage from external source files that can be either XML documents, which are collections of records of an arbitrary XML structure, or JSON files which must follow a predefined schema specific to mod-inventory-update.

Mod-inventory-update performs so called “upserts” of inventory storage, determining on the fly whether inventory records should be created or updated. All upserts of instances, holdings records, items, etc are based on mutually agreed upon, unique identifiers. The external “source of truth” would know the unique and persistent IDs of its instances, holdings and items (or rather its equivalent data structures) in its source system, and these IDs are in turn stored in the respective HRID properties of the records in inventory.

The JSON based APIs have been around for a while, serving a pretty narrow purpose of acting as a facade for inventory storage that would relieve an external client of the details of inventory CRUD and provide batch updating, metrics and error feedback to the client. It basically consists of just a couple of relevant APIs.

This document describes the addition of a more numerous set of APIs that play different roles in support of importing XML records to inventory. The XML import API includes

  • APIs for configuring the imports, in particular the transformation of XML records of an arbitrary format into structures that can be pushed as JSON records to inventory

  • APIs for controlling the execution of XML imports

  • APIs for monitoring the progress of imports and checking for errors and troubleshooting

  • An API for actually uploading (importing) the XML records

The introduction of the XML importing APIs does not mean that the existing JSON APIs will change in any way. Internally, the new XML APIs will themselves use the JSON “facade” in the interaction with inventory. So both JSON input and XML input will go through the “upsert engine” in MIU.

This diagram attempts to illustrate the architecture where both JSON and XML go through the “upsert engine” in the lower left, but the XML files will go into a queue and the records will then go through a longer process of transformations before the upserting stage. The XML import API is therefore asynchronous, meaning that there will not be an immediate response to import request other than that the file has been uploaded. The JSON APIs are synchronous, returning a real time response with the outcome of the update request.

MIU APIs, overview.
Overview of MIU’s APIs

Channels

The importing component uses the term channel for the collection of configurations and processes that supports importing an ordered collection of XML files into inventory storage as instances, holdings records and items.

The channel can be said to consist of

  • a basic channel object, a configuration record with

    • the channels name and

    • some indicators defining if the channel is enabled and actively consuming source files as they arrive, and

    • a reference to the transformation pipeline that the channel should apply to the source files

  • a dedicated process – a Vert.x verticle to be concrete – handling the processing of the source files

  • a file system based queuing mechanism, using Vert.x file system

  • the transformation pipeline, which consists of

    • a set of one or more custom written transformation style sheets, organized in an ordered list of so called transformation steps

    • a static XML-to-JSON converter that converts inventory record set XML to inventory record set JSON

  • a component that batches the converted records for optimal use of inventory storage’s batch updating APIs

  • a reporting component that logs

    • when job starts and ends

    • files processed

    • record counts

    • performance metrics

    • records that failed to update

    • other major events

More than one channel can use the same transformation pipeline in order to perform large import jobs by running multiple channels in parallel if feasible.

Channels are currently hard-coded to accept source XML files of up to 100 MB each. The limit could conceivably be set higher or made configurable, but a web service requires considerably more than 100 MB to receive a 100 MB file (depending on the method of uploading that the client chooses).

The file upload process and the channel processing are distinct processes, that are not tightly chained. The upload process ends when the source file is saved in the file system queue, which happens quickly. The channel is then a separate process that “listens” for incoming files in its dedicated file queue, and that’s where the real work happens of course. The outcome of that process can be followed through the monitoring APIs.

The module’s README has additional information about the APIs that handle channels here.

XML transformation

The heart of the import process is a set of one or more custom written style sheets that transform arbitrary metadata records into a structure that is compatible with the schema of inventory storage. The transformation pipeline can be pieced together by multiple style sheets that can be added, viewed, reordered or deleted through MIU’s import admin APIs.

Once the XML transformation is completed the module will convert the XML to JSON, gather the records in batches, and push them to inventory storage through the upsert engine.

This means that there is an internal contract for the format of the XML that must come out of the XSLT transformation. The XML must be of a structure that can be generically converted to JSON i compliance with the JSON schema for MIU, the so called “inventory record set”.

The high level structure of an inventory record set is a mandatory instance object, a separate, optional array of holdings records, each with an optional array of items, and then separate objects for instance relations and processing instructions. The module README has operational details about the various elements of the inventory record set and the way MIU will handle them.

- Inventory instance with hrid and other properties - Array of inventory holdings records (optional) - Holdings record with hrid and other properties, and - Array of inventory items (optional) - Item with {hrid and other properties} - Item - ... - Holdings record - Items - Item - Item - ... - ... - instance relations (optional) - processing instructions (optional)

This is in other words the structure that the XSLT transformations must build. In addition, style sheet authors must take into account the differences in the annotation of arrays in JSON and XML respectively. Without a XML schema, there is no way of telling whether an element is potentially an array of one or just an element (even if a plural ‘s' in the element name would be a hint). But with JSON we have to know. MIU thus have an internal convention (mandatory, that is) of annotating arrays as <arr> with embedded elements named <i>, as in <subjects><arr><i>topic 1</i></arr></subjects>. This would become “subjects”: [“topic1”, “topic2”] when converted to JSON. The convention of <arr> with <i>s is the same for the array of holdings records of an instance, and for the array of a holdings record’s items.

The import admin database

Importing to inventory is set up as one or more channels, each with an associated transformation pipeline, which in turn consists of one or more transformation steps.

When a file is uploaded to an idle channel, it will trigger the creation of a status record in the import job table. The import job record holds the status of the import job, like start time, state and eventually finish time (or interrupt time). The job is considered finished when there are no more uploaded files in the queue to process. Events and metrics are logged to an import log table as the job progresses, and in case of an update error affecting the processing of instances, holdings or items, a detailed error description for each affected inventory record set will be written to a failed records table. Logged records can be purged automatically at configurable intervals.

import-admin-db.jpg

Notes on the import queue

When uploading a XML file to MIU for importing to inventory storage, the file is put in a queue. The queue is implemented using the Vert.x filesystem, meaning that the files are temporarily written to a predefined directory structure in the deployment directory of the module.

This provides for a quick, simple and transparent queuing mechanism with a minimal memory footprint but alternative implementations are of course be possible, for example storing the uploaded files in the database.

The file system was chosen for it’s simplicity and low memory requirements. The file system is meant to be transient, it’s just a queue that the source files pass through until they are imported and can be deleted again. Cleaning up files and directories after use is quick and easy. Retaining the directories between re-deployments of the module is possible, for example in order to pick up an import process that was interrupted by a module shut-down, but it is not a system for persisting or managing files in any way.

It cannot be ruled out that the file system implementation would cause issues in a clustered environment compared to a database implementation. Also, if requirement came up to have the files persisted for tracking or similar, then some form of reliable storage would be needed instead.