ERM/Licenses/Re:Share Tool Selection Rationale

The K-int team are working with GBV and other FOLIO subgroups on the development of ERM, License and Resource Sharing functionality. As a team, we have a long history developing rich-functionality applications in this area. FOLIO provides a default tooling which includes the RAML module builder. In setting out to develop the ERM modules the team looked at the options and decided that the RAML module builder approach was not a good fit for the functional requirements of the ERM module. This question comes up a lot with developers who ask "Why are you not an RMB app". This page is an attempt to explain some of the rationale behind that decision, and to map out how we see the future:

RMB takes as its main approach the JSONB document-oriented storage model of postgres. Essentially, asking postgres to act as a document db like Mongo, where the mongo "Collection" construct is mapped to a postgres table. This approach excels in schemaless operation and where very large datasets need to be sharded over multiple nodes. The approach is particularly strong at insert time performance and convenience for Create and Retrieve operations where the "document" unit is very stable. The approach can suffer badly with respect to design for unforseen use. Developers need to be extremely careful when designing their document structures for consumption by the document-oriented approach. Although it is absolutely possible to navigate the document tree and join from within nested structures SOME nested constructs cannot be easily navigated and joined. This is particularly the case with ordinal positions of specific items in a list, particularly where selection is filtered based on one or more sub-properties.

What we have with ERM, licenses and Resource Sharing is by it's nature highly relational data - A "TitleInstance" can appear in many different packages as an item of content and can be made available on many different platforms (Nature on nature.com, Nature as supplied through myarchivalservice.com, etc). Nature as a "Work" may also be made available in several different formats and via numerous publishers over time. Looked at from a domain modelling perspective, the ERM application is data that is viewed in several different ways depending upon the end user operation needed. Sometimes a user will say "Please show me all the options for acquiring Nature" - expecting to see a list of packages and individually purchasable titles. Other times, a user may say "I currently have package X, but am approaching the end of my deal period, I'd like to compare the context of Package X with Package Y and see what overlap of titles there is". These two queries both have title data at their heart, but require radically different database access paths. Efficiently providing this level of flexibility in query operations is a hard problem for document oriented structures.

Our experience with ERM systems leads us to believe that there is a high degree of uncertainty about what retrieval (Complex Search) operations will be needed over time. Adopting a relational schema gives us the best chance of flexibly being able to respond to unforeseen query requirements in a performant way.

This flexibility, however, comes at a cost. It is not possible to just "Stuff" a "Blob" of json into a table and say "I've inserted a package". Whereas in the document approach a high degree of duplication would appear at the package level, in the relational approach each entity must be pre-coordinated and matched to an existing tuple, or a new one created. Fortunately, this level of pre coordination is entirely in line with the requirements and expectations of our user community - who almost invariably want to have title references in particular, coordinated.

A major feature of document oriented databases is their ability to act in a schema-less mode - where fields can be dynamically added on demand. RMB modules remove this capability however, by demanding that instances rigidly conform to a json schema and rejecting any fields that do not appear in that schema - this seems to undermine one of the fundamental strengths of the approach.

In order to mitigate the "Object-Relational-Impedence-Mismatch" the ERM team chose a framework which is based on hibernate but which provides extremely sophisticated data mapping functionality - GRAILS provides functions to automatically traverse POSTED JSON in CRUD operations and at each level of the hierarchy invoke data-mappers which translate between a particular serialisation, and the more traditional domain model. This gives us the best of both worlds - A flexible and expressive REST API that can provide documents tailored specifically for the operation being requested, but a normalised internal storage format which is amenable to unforseen queries and is able to adapt and react to changing retrieval requirements.

Furthermore - because the storage model is private we are able to hide implementation detail from the interface. This is contrasted heavily with RMB where the internal storage format is essentially dictated by the public interface definition. In ERM, our interfaces (In and out) always represent one possible serialization of the internal model. This gives us flexibility and the ability to evolve new interfaces independently of our internal storage format. It will also allow us to support GraphQL directly against the module - this will be useful as any given request will only need to locate and serialize the data actually requested - rather than handle objects as immutable blobs - this property is essential when dealing with packages of 1 million titles - even if the list only contained identifiers that link to some other object.

K-Int has used a number of pre-existing tools and libraries to support this work - in particular the KIWT library is used in it's Community distribution to provide instant REST endpoints. Over in Re:Share this library is used to automatically generate JSON SChema for domain objects, and is combined with system level RAML output - so instead of having the RAML external to the app and used to specify it's internal semantics, the app can be asked to describe it's own interface in RAML, and it's own schema objects in JSON Schema - see https://github.com/openlibraryenvironment/mod-rs.