Executive Summary

Check out performance is unacceptable (a check out is expected to take less than 1 second). Analysis demonstrates that the current response time is taken up by many HTTP requests (approx. 30) to other modules.

Multiple options are presented to reduce this overall response time:

Improve the performance of individual downstream requests
Make downstream requests concurrently
Combine multiple downstream requests for related records into a single request
Combine the business logic and storage modules together
Use derived data to make decisions

These options are limited to changes that support the current domain model, processes and user experience. There are other options that may be worth exploring which involve reviewing those decisions.

Many of those options are likely to only produce a modest improvement that is unlikely to be sufficient to meet expectations.

Some of the these options may require changes to constraints e.g. the separation of business logic and storage modules, or the use of derived data that were present when much of the circulation development was done.

Recommendations are provided at the bottom of the document.

Introduction

This document aims to summarise the outcome of the recent performance testing conducted by the PTF team and provide some suggestions as to how we might improve the performance of checking out an item under load.

Context

History

Development of modules within the circulation domain started early in FOLIO's overall development, meaning they are some of the oldest modules and integrate heavily with other older modules.

Historical Constraints

When FOLIO started, it began with some constraints that were applied when these modules were developed.

I've picked out a few that could be relevant to how we got to the current design

Business logic must use the most current state for decisions (this is what the SME have told me in the past conversations I've had and is supported in technical documentation)
Business logic and storage are split in two separate modules (in order to support independent substitution)
All integration between modules is done via HTTP APIs (proxied via Okapi)
All data is stored within PostgreSQL
A record oriented design with a single system of record for each record type (business logic / storage separation not-withstanding)

As these weren't explicitly documented at the time, it is difficult to know if they have changed over time.

There are counter examples which would suggest that they may have changed:

some more recent modules (e.g. mod-courses, mod-data-export-spring) have both business logic and persistent storage within the same module
some modules (e.g. mod-search, mod-inventory, mod-data-export-spring) use Kafka (rather than HTTP) for communication
some modules (e.g. mod-search) use Elastic Search for persistent storage (though not the book of record)
some processes e.g. data import) are working on using potentially inconsistent or out of date data in order to improve performance

I don't know if these constitute a change in policy or specific (approved or not) exceptions.

Some of the options presented below contradict these constraints and would need them to change to be tolerable and coherent within FOLIO's architecture.

Expectations

A checkout (including the staff member scanning the item barcode) must complete within 1 second (from the documented /wiki/spaces/DQA/pages/2658550). It is stated that this includes the time for a person to scan the item barcode.

For the purposes of this analysis I shall assume the following (neither of which are likely true in practice):

None of this time is taken up by the human scanning the barcode (and interacting with the UI)
None of this time is taken up the FOLIO UI (in practice, the UI has to fetch item information to potentially ask the staff member some questions)

The performance requirements do not provide any guidance on what the conditions (load parameters or resource configuration) this expectation should hold for.

Thus for the purposes of this analysis, the expectation is that:

the check out API must respond within 1 second under load from 8 concurrent requests (with no tolerance for outliers that exceed this limit)

Solution Constraints

Beyond the general constraints on architectural decisions (listed above), I've imposed the following constraints to limit the option space:

No changes to the current user experience
No changes to the current circulation (or associated) domain models
No changes to the check out process
No changes to the client interface of the circulation APIs
- excludes using hypermedia controls to defer fetching some data
Only existing infrastructure can be used (I'm including Kafka in this, even though it isn't official yet)
- excludes options which including changing the storage technology e.g. MongoDB or integration mechanism e.g. gRPC

This means that there are a set of potentially interesting options are not presented here. These include (but not limited to):

Changing the circulation and domain models to have separate definitions of an item
Changing the check out process to consider batches of items being checked out to the same patron
Changing the check out user experience to allow multiple items to be submitted for check out prior to feedback on any being given

Acknowledgements to Vince Bareau for providing some of these suggestions.

Analysis

Limitations of Analysis

Thanks to the PTF team we have overall performance data for (an approximation of) the whole check out process.

However we only have a single detailed sample (with less data and no load) of the downstream requests from the check out API. That sample is not representative of the range of response times likely present in a whole performance test run using more realistic load parameters.

Thus, this analysis has to assume that the sample is representative whilst also interpreting it skeptically (because it is likely far more optimistic than the heavier load scenarios).

We also do not know:

why the response times of the constituent parts do not equate to the overall response time
what amount of time Okapi takes to process requests / responses
what amount of time mod-circulation takes to use this information to make decisions e.g. to apply the circulation rules

This factors mean it is challenging to draw reliable and specific conclusions about the requests involved, meaning that most of the analysis will be broad and general.

What takes up the time?

Step	Time Take
Generating a downstream token (assumed to be once per incoming request)	133 ms (99 + 6 + 16 + 12)
Checking request token (for each downstream request)	12ms (average)
Downstream request	50ms (average)

There are 27 downstream requests triggered by mod-circulation during the sample check out.

Once we deduct the initial overhead (133ms) that leaves us with an approximate budget of 32ms per request (867 ms / 27).

At the moment, the average request in our low load sample takes 62ms (including proxying overhead). This is more than double the budget we have available and we can expect the situation to be worse under load.

Whilst there are some outliers (that are still likely lower than under load numbers) that push up this number, I think this indicates the degree of challenge we have with the current approach.

What could we do?

Broadly speaking there are three things that can be done to improve the response time of a check out API request

Reduce the amount of time each request takes
Make downstream requests concurrently
Reduce the quantity of downstream requests made

These ideas will be the framing for the proposal part of this document.

Options

Improve the performance of individual downstream requests

Characteristics

Scope for improvement is limited as many of these requests are individually relatively fast
Improvements are brittle and can be easily undone by changes to downstream modules (and it may take a while to become aware of degradation)
Limited by the constraints of the downstream modules (e.g. the data is currently stored as JSONB)
May involve changes in multiple modules
Retains the same amount of downstream requests
Retains the same overhead from Okapi proxying

Make downstream requests concurrently

For example, once the item is received, the locations, loan types and material types can be fetched concurrently.

Characteristics

Only involves changes to mod-circulation
Increases the complexity of the code in mod-circulation
Not all requests can be made concurrently (some are based upon prior requests or decisions that cannot be made up front)
Is likely limited by how well other modules / database can handle concurrent requests
Retains the same overall load on the system as before (although it may be compressed in time)
Retains the same amount of downstream requests
Retains the same overhead from Okapi proxying

Combine multiple downstream requests for related records into a single request

Introduces context-specific APIs that are intended for specific use. At most, this can only be applied to the requests made to the same module.

It may not make sense to combine all of the record types from a single module. For example, does it make sense to have an API that fetches existing open loans and loan policies together?

We are already introducing a new API in mod-inventory-storage in this manner to improve the pre-checks made by the check out UI.

Characteristics

Reduces the amount of individual downstream requests (and hence the Okapi proxying overhead)
Requires at least one downstream request per destination module
Requires at least one database query per downstream module
Might reduce the response time off the downstream request (compared to the combination of )
Might reduce the load on downstream modules (depending upon how the combined request is handled, it is possible the load increases)
Reduction in downstream requests is limited to number of record types within a single module
Increases the amount of APIs to maintain (what I call the surface area of the module)
Increases the coupling between modules (by introducing the clients context into the other module)
Increases the coupling between the record types involved (e.g. it's harder to move record types to other modules when they are included in APIs together, changes to them ripple across APIs)

Use derived data to make decisions

This involves using data that would be considered eventually consistent, meaning that at the time of use it might be out of date with respect to the source of that data (and at some point after it will become consistent again).

In order for this option to be acceptable, a variety of stakeholders within the community would need to accept some tolerance for decisions being made with inconsistent information. When I've talked with folks about this previously, they have been uncomfortable with doing this (see above).

This option also has the most potential design variations. It might involve in-memory caching or persistent storage of the copied data and it may involve using messaging infrastructure or periodic requests.

Characteristics

Increases the potential for inconsistent data (that is out of date) to be used for decisions
State changes still require a downstream request (and the requisite proxying overhead)
Is contrary to constraints that may still be present in FOLIO
Requires less / no (depending upon variations chosen) downstream requests for fetching data during check out process
Introduces complexity of either caching or processing messages and persistent storage into mod-circulation
May introduces a dependency for mod-circulation on a database
May introduces a dependency for mod-circulation on messages produced by other modules
May introduce the need for a system user for mod-circulation (to populate / update the cache)

Variations

The characteristics of this approach varies more based upon some design decisions we make. A couple of the significant ones are outlined below.

These are only a very high level comparison of the characteristics, there are lots of alternative designs in both of these categories that lead to different characteristics.

Where is the data kept?

	Memory	PostgreSQL
Volatility	lost when the module instance is terminated	retained even if module instances are terminated
Locality	local copies for each module instance	shared between module instances
Access Control	shared needs to be controlled with code within the module	can be controlled using mechanisms provided by the database server
Responsiveness	Likely faster if cached value is present, likely slower if not	Dependent upon network and database load
Record Type Suitability	Better suited to smaller sets that change rarely, e.g. reference types	Can be used for any kind of record type
Infrastructure needs	None	Requires a database for mod-circulation

How is the copied data updated?

	Periodic HTTP requests	Messages consumed from Kafka
Freshness	Dependent upon frequency of periodic refresh. Likely to be lead to data being inconsistent for longer than with messaging	Dependent upon message processing latency
Access requirements	Needs a system user or module permissions granting to a timer endpoint	Needs access to Kafka topics for every record type (assuming record snapshot based messages as used with mod-search)
Initial population / manual state refresh	Requires requests to fetch all records for for all cached records types	Either requires reprocessing of persistent topic (not currently allowed by FOLIO standards) or custom process (similar to mod-search re-index process)
Load on other modules during synchronisation	Could be significant. Dependent upon number of record types and quantity of records	Potentially none with persistent topics (not currently allowed by FOLIO standards)

Combine the business logic and storage modules together

Characteristics

Removes all downstream for record types within the circulation domain e.g. loans, requests, loan policies etc (include state changes e.g. creating a loan, fulfilling a request)
Removes the distinction between business logic and storage representations of those records types
Allows for state changes within the circulation domain to be done within a database transaction
Is contrary to constraints that may still be present in FOLIO
Storage modules have been used to workaround cyclic dependencies constraints in Okapi, removing them might involve changing other modules to avoid this in other ways

Recommendations

Before reading this, please take some time to consider the options presented above and consider them for yourself (alongside any others you may think of) in order to reduce the potential for my recommendations to sway your opinion.

There are lots of unknowns with all of these options. It is difficult to predict how long they will take or how much improvement will be achieved. Please keep that in mind when considering these recommendations.

Earliest reasonable improvement

Combining multiple downstream requests into a single request is likely to provide some improvement. This work is familiar to developers and can be achieved without contradicting broader architectural concerns (beyond the coupling considerations).

As there is already ongoing development work that explores this, we can use that to gauge the effectiveness of this approach before committing to a direction for continued work.

Most significant improvement

Copying data into circulation has the most potential for improvement as it removes the need for many of the downstream requests entirely.

However, this work requires:

adoption of techniques (e.g. synchronising copied data, messaging, caching) and technologies (e.g. Kafka) unfamiliar to most developers in FOLIO
agreement from many stakeholders (e.g. SME's, TC) that it is acceptable to use potentially inconsistent or out of date information for making decisions

Suggested Plan

Continue with work to improve the slowest response time HTTP requests in the analysis:
- Fetching automated patron blocks (the Vega team have already done some work to improve this)
- Fetching manual patron blocks (Holly has raised an issue for this)
- Fetching an item by barcode (the Core Platform team have already done work to improve this)
- Updating an item (Julian has raised an issue for this)
Work with the RA SIG and POs to decide whether some tolerance for potentially inconsistent / out of date information may be used during check out
Work with Technical Council and Tech Leads on designs for the use potentially inconsistent data e.g. caching or persistent derived data
Begin to introduce derived data during check out
- Starting with introducing an expiry cache for a single record type that is populated during a check out, as this approach likely has a shorter lead time to benefits at the trade off of less consistent performance
- Iterating through remaining record types in a priority based upon feedback from the RA SIG and the relative response times for requests (likely isn't appropriate for large data sets like instances or items)
- Investigate adopting Kafka for updating the cache (as this could lead to reduced delay to become consistent)
- Investigate adopting persistent derived data rather than caching (as this removes the need for per-module instance caching, can spread rather than duplicate the effort of achieving consistency, and allow for more use of derived and more efficient structures)

Appendices

Definitions

Phrase	Definition
Downstream request	A request made by a module (via Okapi) in order to fulfil the original incoming request e.g. mod-circulation makes a request to mod-users to fetch patron information
Response time	The time taken from the client making the request to receiving a response
Derived data	"A dataset that is created from some other data through a repeatable process. Usually use to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data" ([1], pg. 554)
Eventual consistency	Derived data can be outdated with respect to it source(s). It is intended that this inconsistency is temporary, however this deliberately vague and there is no limit on on how far behind ([1] pg. 162) a copy might get or how long it will take for it to become consistent again. The term originates from database replication and is often used in event sourcing or event driven architectures.

Requests made during a typical check out

The first 4 lines of the table describe the initial requests made by Okapi in reaction to the incoming request (to check out). I believe there are circumstances where these requests are made again, however that is omitted from this analysis.

Intent	Endpoint	Destination Module	Sample Response Time (ms)	Sample Response Time of Token Check (ms)
Initial request				99
Fetch user (making the request)	GET /users/{id}	mod-users	6
Fetch permissions	GET /perms/users?query=userId=={id}	mod-permissions	16
Generate downstream token			12
Fetch user (patron) by barcode	GET /users?query=barcode=={userBarcode}	mod-users	13	86
Fetch manual blocks	GET /manualblocks?query=userId=={userId}	mod-feesfines	133	7
Fetch automated blocks	GET /automated-patron-blocks/{userId}	mod-patron-blocks	546*	27
Fetch item by barcode	GET /item-storage/items?query=barcode=={itemBarcode}	mod-inventory-storage	163**	10
Fetch holdings	GET /holdings-storage/holdings/{id}	mod-inventory-storage	57	9
Fetch instance	GET /instance-storage/instances/{id}	mod-inventory-storage	22	7
Fetch location	GET /locations/{id}	mod-inventory-storage	9	13
Fetch library	GET /location/units/libraries/{id}	mod-inventory-storage	10	7
Fetch campus	GET /location/units/campuses/{id}	mod-inventory-storage	10	7
Fetch institution	GET /location/units/institutions/{id}	mod-inventory-storage	11	7
Fetch service point	GET /service-points/{id}	mod-inventory-storage	9	8
Fetch material type	GET /material-types/{id}	mod-inventory-storage	8	7
Fetch loan type	GET /loan-types/{id}	mod-inventory-storage	22	8
Fetch existing loans	GET /loan-storage/loans?query=status.name=="Open" and itemId=={itemId}	mod-circulation-storage	9	17
Fetch requests	GET /request-storage/requests?query=itemId=={itemId} and status==("Open - Not yet filled" or "Open - Awaiting pickup" or "Open - In transit" or "Open - Awaiting delivery") sortBy position/sort.ascending	mod-circulation-storage	10	9
Fetch circulation rules	GET /circulation/rules	mod-circulation-storage	18	18
Fetch loan policy	GET /loan-policy-storage/loan-policies/{id}	mod-circulation-storage	10	8
Fetch tenant locale	GET /configurations/entries?query=module=="ORG" and configName=="localeSettings"	mod-configuration	16	10
Fetch overdue fines policies	GET /overdue-fines-policies/{id}	mod-feesfines	19	8
Fetch lost item fees policies	GET /lost-item-fees-policies/{id}	mod-feesfines	11	10
Fetch opening days	GET /calendar/periods/7068e104-aa14-4f30-a8bf-71f71cc15e07/calculateopening?requestedDate={{dueDate}}	mod-calendar	12	8
Fetch user (patron) groups	GET /groups?query=id=={groupId}	mod-users	17	7
Update item status	PUT /item-storage/items/{id}	mod-inventory-storage	194	13
Create loan	POST /loan-storage/loan	mod-circulation-storage	16	8
Update patron action session	POST /patron-action-session-storage/patron-action-sessions	mod-circulation-storage	10	7
Fetch user	GET /users/{id}	mod-users	6	15
Fetch patron notice policy	GET /patron-notice-policy-storage/patron-notice-policies/1a821238-0cd9-48d9-a71a-057d33df0154	mod-circulation-storage	6	7

References

[1] Martin Kleppman: Designing Data-Intensive Applications. O'Reilly, 2017. ISBN: 978-1-449-37332-0

Marc Johnson

Check Out Performance

Executive Summary

Introduction

Context

History

Historical Constraints

Expectations

Solution Constraints

Analysis

Limitations of Analysis

What takes up the time?

What could we do?

Options

Improve the performance of individual downstream requests

Characteristics

Make downstream requests concurrently

Characteristics

Combine multiple downstream requests for related records into a single request

Characteristics

Use derived data to make decisions

Characteristics

Variations

Where is the data kept?

How is the copied data updated?

Combine the business logic and storage modules together

Characteristics

Recommendations

Earliest reasonable improvement

Most significant improvement

Suggested Plan

Appendices

Definitions

Requests made during a typical check out

References