Improved Handling of Reference Data During Upgrades

Summary

This document represents the final recommendations from the “Reference Data and Upgrades” working group that was commissioned by the Folio Technical Council

Motivation

By design, Folio offers a great deal of flexibility and the ability to highly customize system-supplied default data sets. This can be a problem if two sets of changes to a common original data set are incompatible, yet need to be reconciled. This happens frequently during the Folio upgrade process where customizations are in conflict with intended changes being introduced by the most recent release version.

There are two parties involved in these cases. Applying an upgrade to a working Folio installation is an operational concern, which is the concern of technical staff: devops or sysops. Creating and maintaining customizations to Folio is the responsibility of subject matter experts who are users of Folio: librarians. The goal of these proposed changes is to streamline the upgrade process so that each of these parties can perform their responsibilities independently without being blocked by necessary steps from the other.

Detailed Explanation/Design

The fundamental issue is that the same set of data is used to hold system-defined reference data, user created custom reference data, and customization to existing reference data. A system upgrade can only know how to apply changes to a pristine Folio installation. Similarly, customizations can only be made against a known set of default reference data; they cannot anticipate future changes to default reference data that will be introduced by future upgrades.

This recommendation proposes two fundamental changes.

The separation of current reference data into two separate sets: default data; operational data
The sequencing of the upgrade process into two distinct steps: default system upgrade; customization upgrades;

Two Separate Data Sets

It is recommended that reference data be split into two sets.

Default Data. This represents the set of reference data that is distributed with each Folio release. As such it represents a pristine Folio installation - no customizations. The ability to upgrade from the prior set of Default Data to the new set of Default Data can be completely automated.

Operational Data. This is the set of reference data used at runtime by a Folio installation. In a pristine Folio installation the Operational Data is merely an identical copy of the Default Data. Any and all customizations are applied to Operational Data.

Two Step Upgrade Process

Under this proposal, upgrading an active Folio installation is a two step process.

Upgrading Default Data. This is the necessary first step in an upgrade and is the concern of the operational team. Here, the Default Data of the system is upgraded to reflect the new state of Default Data as specified by the new release. As part of the process, any potentially conflicting customizations - in the Operational Data - are identified, and a report is generated for use in the second step below. Furthermore, any identified conflicting customizations are preserved but yield in favor of the new Default Data changes. The goal is to produce a functional upgraded Folio installation. This step can be fully automated and should be provided as part of the distribution.

Reconciling Operational Data. This is the second step, whereby any conflicting customizations that were put aside in favor of Default Data can be addressed. This is the concern of the subject matter experts and they now have the functional upgraded Folio installation from the previous step with which to reconcile the conflicting customizations. This could involve manual evaluation and changes to achieve reconciliation.

Observations:

If the Folio installation does not contain customizations, or if the customizations do not conflict, the upgrade can be completed after the first step and be fully automated.
Reconciliation of conflicting customizations within the Operational Data may take the form of:
- Deprecating the customization in favor of the new Default Data
- Reapplying the customization onto the new Default Data
- Merging the customizations with Default Data
- Correcting any references to previously used customizations.

Upgrade Process Overall Workflow

The following diagram shows the overall workflow for handling reference data during an upgrade. The diagram shows the following:

The upper track represents the responsibilities of system operators. The completion of this track results in a working system that can be handed over for further review.
The lower track represents the responsibilities of the subject matter experts, who start from a working system where conflicting customizations have been mitigated (though not lost) and can be reviewed. The completion of this track results in a working system with all customizations reconciled.
The review of conflicting customizations can be greatly assisted by the reports generated in the upper track (shown in orange). These provide a list of customizations for manual review within Folio, but can also serve as the basis for creating scripts for data correction.
The workflow supposes that this process is applied to a freshly created staging system which is a clone of the production system. The expectation is that these steps could then be re-applied to the production system. To this end, the operations in the highlighted steps (purple) could be recorded for later playback.
There is the opportunity to create additional external toolsets to assist in this process. These could include comparison tools to better identify and resolve conflicting customizations, as well as batch tools to apply data corrections (e.g. to remove deprecated customizations).
Additional external tooling could also be developed to manage the upgrade process. It could record decisions that were made during an upgrade so they could be remembered for subsequent upgrades.

Reconciling Conflicting Customizations

The following flowcharts illustrate the logical paths that would be implemented in the purple boxes shown in the above diagram which relate to handling of conflicting customizations.

As part of the upgrade process, some changes to Default Data may cause conflicts with previously defined customizations as found in the Operational Data. In order to allow the system operators to hand over a working Folio installation to subject matter experts for further review, priority is given to Default Data changes. Changes would include additions, modifications or deletions. Some customizations do not cause conflicts and can automatically remain in place in the Operational Data set. However, as changes from the new Default Data are applied to Operational Data, the conflicting customizations must be moved aside to avoid the conflict. They will be reviewed and resolved later.

The following truth table illustrates the rules to be applied for reconciling what is a 2 way merge. The first three rows allow for automation. The last case, where both Operational Data and the new Default Data reflect changes, represents the situation where a review is required. There are additional considerations for handling that case depending on the actual nature of each of those changes: deletion; addition; modification.

Old Default	Old Operational	New Default	New Operational
base	unchanged	unchanged	unchanged
base	unchanged	changed	apply
base	changed	unchanged	keep
base	changed	changed	review

A two pass process is necessary to identify all conflicting customizations. The first pass is to account for any customizations that might be unique to Operational Data - i.e. not found in either the old or new Default Data. The second pass iterates through the changes to Default Data and determines if they create any previously unidentified conflicts due to customizations.

First Pass

Look for operational data entries that are not in Default Data and mark them appropriately. This must be done in a way to overcome any potentially conflicting uniqueness constraints. For example, these could be marked by relabeling by adding a suffix.

For each entry in the Operational Data:
   If entry identifier is in Default Data: 
      do nothing # these are not the droids you’re looking for
   Else:
      If entry identifier is in Old Default Data: # it was removed in current Default Data
         Add to report;
         Mark as deprecated; # will need to be reviewed
      Else: # it is a completely new custom entry
         Add to report;
         Mark as custom;  # will need to be reviewed

Second Pass

Examine each default data entry and compare to operational data

For each entry in the default data:
   If entry identifier is found in operational data:
      If operational data matches default data: #not customized 
         do nothing;
      Else: # it was customized
         Add to report;
         Mark as modified; # it will need to be reviewed
   Else: # missing in operational data
      If entry identifier is in old default data: # it was explicitly removed
         Add to report; # it will need to be removed
      Else:  # it is brand new
         Do nothing special; # safe addition of default data entry
         Add (or restore) to operational data

Data Representation

It is recommended that the two data sets, Default and Operational be represented separately in the storage system. In order to allow a smooth and coordinated upgrade process, it is necessary for all modules involved to provide support for the mechanisms defined. Given that each module may be developed independently and from different architectural patterns, it is important to minimize any prescriptive rules that might be imposed on them. Therefore, the simplest solution is to require that two distinct storages (i.e. tables) be used, which are otherwise identical. The Operational Data is used at runtime by the module, while the Default Data set exists only to support the upgrade process.

The two storage approach simplifies the complexity required by the runtime code. At runtime, the modules will only be aware of a single set of data values (the Operational Data). No additional logic is required in the code to distinguish or otherwise manage two data sets.

This also provides a clean separation of concerns. While the runtime system only considers the Operational Data set, the Default data sets are only involved during system upgrades. After the upgrade has been completed the new Default Data set becomes the old Default Data set during the next upgrade.

Use Cases

The following use cases indicate the real world situations where problems currently exist and how they would be resolved by this proposal…

Incomplete Reference Data Sets

Consider that the set of material types available in a release may be insufficient. Imagine that ‘Blu-Ray’ is missing. A library tenant may then add a custom value for ‘Blu-Ray’ and furthermore assign it to resources in Inventory. But in the subsequent release Folio addresses the issue and provides its own ‘Blu-Ray’ material type. The two values are now in conflict.

With this proposal, the custom ‘Blu-Ray’ would be set aside and the conflicting value changed: for example to ‘Blu-Ray-custom’. The new release’s version of ‘Blu-Ray’ can then be applied from the Default Data set into the Operational Data set. The two versions can now coexist, but the automated upgrade is completed and the system is operational for the subject matter expert to reconcile the conflicting customization.

The Folio installation will now show both ‘Blu-Ray’ and ‘Blu-Ray-custom’ as available material types. Inventory items which have previously been assigned the custom version now show ‘Blu-Ray-custom’ as the material type. A search for ‘Blu-Ray’ material types will not find those.

The SME now has a decision to make. Options include:

Remove the ‘Blu-Ray’ material type introduced by the upgrade and restore the label on the custom material type. This decision maintains the installation as divergent from standard Folio going forward, including the need to manage this customization.
Update the material type assignments for resources to reference the new ‘Blu-Ray’ value and then remove the custom material type. This decision aligns the installation with Folio and reduces the customization effort going forward.

Deleted or Suppressed Default Data Fields

Consider that the Folio distribution might include a predefined, default, set of Loan Types. An individual library tenant may not find this entire set of loan types to be relevant. Perhaps the institution is a small public library with limited lending capabilities. In this case they may want to suppress or remove the presentation of loan types that are not relevant to them.

This proposal solves this problem by allowing the undesirable Loan Types to be deleted from the Operational Data. Since the runtime of Folio only considers the Operational Data set, the undesired values will not be shown in the user experience.

However, the complete set of Loan Types remains in the Default Data set. It will then be possible at the next upgrade to recognize that this is the case of a default data suppression.

Customizing Field Labels

Field labels may be customized for various reasons, including institutional conventions or local languages. Consider that a German Folio site wants to make a local change to labels: replacing "can circulate" with "ausleihbar" and not lose that change during an upgrade.

In the scope of this proposal the label change is merely considered a customization of a label. There is no special consideration to the motivation for the label customization - in this case for localization purposes.

Risks and Drawbacks

Upgrading Production Folio Systems

Since any system upgrade contains some amount of risk, it is generally considered a best practice to first attempt a system upgrade against a staging system. This is also true for Folio upgrades.

A clone of the production Folio system can be created, upon which the two step upgrade process can be applied. By capturing any corrective operations in the second step (reconciling operational data) on the staging system, it may be possible to pro-actively apply some or all of them to the production system before attempting to apply the final upgrade there. This can help reduce or perhaps eliminate any conflicting customizations and thus reduce the risk during the production upgrade.

Rationale and Alternatives

An alternative implementation that was considered is to represent both Default and Operational data in the same record. Each record would have an identifier for the property, fields for the default values, fields for the current values in effect, a flag to suppress an unwanted entry from the UI and a flag to denote purely local additions to the controlled vocabulary. Additional flags would be possible as conveniences, for example to help manage review of new entries. A full description can be found in A Model for Controlled Vocabulary.

While this approach has its merits, it was not selected because of the following considerations.

There would be more complexity in the API: either the API would return all values and the the consuming client would understand to only use the active values for label and description, or the main runtime API would only provide the active values and there would be a separate API or mechanism to perform updates to the controlled vocabulary.

There is also the position that this imposes a prescriptive data model for module development. This could also be seen as providing consistency in managing controlled vocabularies that we use within FOLIO.

Regardless of the model presented through the API, there is also the consideration of storage. If a controlled vocabulary were to be stored on the back end in a relational table, the UUIDs and/or codes of the controlled vocabulary would be available to satisfy foreign key constraints, should other data in the module also move towards relational storage.

Unresolved Questions

The following details are not specifically addressed by this proposal and will require further consideration.

Internationalization / Localization. Reference data customization has been used to solve the issue of overriding English-only labels that may be found there. This is a workaround to the larger problem of being able to localize user displayed strings defined on the backend and specifically in reference data.
Uniqueness constraints for non-UUID values. In some cases primary or even secondary keys for reference data values are defined by, user editable, strings - which is not a robust practice. As a workaround, database level constraints may be imposed to ensure uniqueness and that such strings don’t collide. Furthermore, there are business logic constraints such as URIs, which can be sharable identifiers. In practice these can result in obstacles to automated upgrades during the operator-led phase or when merging tenants. However, there are some expectations from SMEs in order to prevent duplicate/ambiguous codes in the UI.
External Tooling. It is noted that external tooling may be defined and implemented to help manage the state of customization choices made across updates. It is expected that feedback to this document from sysops and SMEs will help drive the requirements for such tooling.

Additional Notes

Significant discussion and Q&A in the MM SIG notes and recording for 24 Sept 2021

Technical Council