Introduction
The goal of this document is to summarise the outcome of the recent performance testing conducted by the PTF team and provide some suggestions as to how we might improve the situation.
Context
History
Development of modules within the circulation domain started early in FOLIO's overall development, meaning they are some of the oldest modules and integrate heavily with other older modules.
Historical Constraints
When FOLIO started, it began with some constraints that were applied when these modules were developed.
I've picked out a few that could be relevant to how we got to the current design
- Business logic must the most current state for decisions
- Business logic and storage are split in two separate modules (in order to support independent substitution)
- All integration between modules is done via HTTP APIs (proxied via Okapi)
- All data is stored within PostgreSQL
- A record oriented design with a single system of record for each record type (business logic / storage separation not-withstanding)
Some of these have changed since this early development e.g. the use of Kafka for integration.
Some may need to change for the options below to be tolerable and coherent within FOLIO.
Expectations
A checkout (including the staff member scanning the item barcode) must complete within 1 second (from the documented /wiki/spaces/DQA/pages/2658550). It is stated that this includes the time for a person to scan the barcode (presumably of the item).
For the purposes of this analysis I shall assume the following (neither of which are likely true in practice):
- None of this time is taken up by the human scanning the barcode (and interacting with the UI)
- None of this time is taken up the FOLIO UI (in practice, the UI has to fetch item information to potentially ask the staff member some questions)
The performance requirements do not provide any guidance on what
Thus for the purposes of this analysis, the expectation is that:
the check out API must respond within 1 second under load from 8 concurrent requests
Solution Constraints
Beyond the general constraints on architectural decisions.
- No changes to the circulation API (the interface must remain the same)
- Only existing infrastructure can be used (I'm including Kafka in this, even though it isn't official yet)
Analysis
Limitations of Analysis
Whilst we have overall performance data for (a mimic of) the whole check out process, we only have a single detailed sample of the downstream requests from the check out API. Also, the response times of the constituent parts do not tell the whole story and cannot be summed to give the overall response time.
However, this analysis has to assume that the sample is representative whilst also interpreting it skeptically.
We also have no data on the amount of time it takes Okapi or mod-circulation to do the rest of the work (of handing these requests or handling a check out respectively).
This makes it challenging to draw reliable conclusions about those requests, meaning that most of the analysis will be broad and general.
What takes up the time?
Step | Time Take |
---|---|
Generating a downstream token (assumed to be once per incoming request) | 133 ms (99 + 6 + 16 + 12) |
Checking request token (for each downstream request) | 12ms (average) |
Downstream request | 50ms (average) |
Once we deduct the initial overhead (133ms), to meet the 1 second expectation there is a budget of 29ms per request (867 ms / 30) including the checking of the token.
At the moment, the average request takes 62ms, which is more than double the budget we have available. Whilst there are some outliers that push up this number, I think this indicates the degree of challenge we have with the current approach.
What could we do?
Broadly speaking there are three things that can be done to improve the response time of a check out API request
- Reduce the amount of time each request takes
- Make downstream requests concurrently
- Reduce the quantity of downstream requests made
These ideas will be the framing for the proposal part of this document.
Options
Improve the performance of individual downstream requests
Analyse and try to improve the performance of each downstream request
Characteristics
- Improvements can be undone by changes to downstream modules
- Limited by the constraints of the downstream modules (e.g. the data is currently stored as JSONB)
- Retains the same amount of downstream requests
- Retains the same overhead from Okapi proxying
Make downstream requests concurrently
Make some of the downstream requests in mod-circulation
Characteristics
- Retains the same amount of downstream requests
- Retains the same overhead from Okapi proxying
Combine downstream requests for related records into a single request
Introduces context-specific APIs that are intended for specific use.
It may not make sense to combine all of the record types from a single module. For example,
Characteristics
- Reduces the amount of individual requests (and hence the Okapi overhead)
- Requires at least one downstream requests per destination module
- Requires at least one database query per downstream module
- Might reduce the load on downstream modules
- Reduction in downstream requests is limited to number of record types within a single module
- Increases the amount of APIs to maintain
- Increases the coupling between modules (by introducing the clients context into the other module)
- Increases the coupling between the record types involved (e.g. it's harder to move record types, changes to them ripple across APIs)
Copy data into circulation
Consume messages produced (via Kafka) by other modules to build views of the data needed to perform a check out.
The biggest challenge with this option is the community's tolerance to using potentially stale data for making decisions.
Characteristics
- Increases the potential for stale data to be used for decisions
- Introduces a dependency on a database from mod-circulation
- Introduces a dependency on messages produced by other modules
- Requires no downstream requests for fetching data
- State changes still require a downstream request (and the requisite overhead)
Variations
Store the copied data in mod-circulation-storage
Rather than introducing a database in mod-circulation, use the database that is already used by mod-circulation-storage.
Downstream requests will be needed from mod-circulation to mod-circulation-storage to access the views.
Cache the copied data in each instance of mod-circulation
Rather than introducing a database in mod-circulation, use a volatile cache within each instance of mod-circulation and use downstream requests to populate the cache.
Downstream requests are still needed from mod-circulation to populate the cache. Response times may be less stable when the cache needs to be repopulated.
Combine the business logic and storage modules together
Characteristics
- Storage modules have been used to workaround cyclic dependencies constraints in Okapi, removing them might involve
Appendices
Definitions
Phrase | Definition |
---|---|
Downstream request | A request made by a module (via Okapi) in order to fulfil the original incoming request e.g. mod-circulation makes a request to mod-users to fetch patron information |
Requests made during a typical check out
The first 4 lines of the table describe the initial requests made by Okapi in reaction to the incoming request (to check out). I believe there are circumstances where these requests are made again, however that is omitted from this analysis.
Intent | Endpoint | Destination Module | Sample Response Time (ms) | Sample Response Time of Token Check (ms) |
---|---|---|---|---|
Initial request | 99 | |||
Fetch user (making the request) | GET /users/{id} | mod-users | 6 | |
Fetch permissions | GET /perms/users?query=userId=={id} | mod-permissions | 16 | |
Generate downstream token | 12 | |||
Fetch user (patron) by barcode | GET /users?query=barcode=={barcode} | mod-users | 13 | 86 |
Fetch manual blocks | GET /manualblocks?query=userId=={userId} | mod-feesfines | 133 | 7 |
Fetch automated blocks | GET /automated-patron-blocks/{userId} | mod-patron-blocks | 546* | 27 |
Fetch item by barcode | GET /item-storage/items?query=barcode=={barcode} | mod-inventory-storage | 163 | 10 |
Fetch holdings | GET /holdings-storage/holdings/{id} | mod-inventory-storage | 57 | 9 |
Fetch instance | GET /instance-storage/instances/{id} | mod-inventory-storage | 22 | 7 |
Fetch location | GET /locations/{id} | mod-inventory-storage | 9 | 13 |
Fetch library | GET /location/units/libraries/{id} | mod-inventory-storage | 10 | 7 |
Fetch campus | GET /location/units/campuses/{id} | mod-inventory-storage | 10 | 7 |
Fetch institution | GET /location/units/institutions/{id} | mod-inventory-storage | 11 | 7 |
Fetch service point | GET /service-points/{id} | mod-inventory-storage | 9 | 8 |
Fetch material type | GET /material-types/{id} | mod-inventory-storage | 8 | 7 |
Fetch loan type | GET /loan-types/{id} | mod-inventory-storage | 22 | 8 |
Fetch existing loans | GET /loan-storage/loans?query=status.name=="Open" and itemId=={itemId} | mod-circulation-storage | 9 | 17 |
Fetch requests | GET /request-storage/requests?query=itemId=={itemId} and status==("Open - Not yet filled" or "Open - Awaiting pickup" or "Open - In transit" or "Open - Awaiting delivery") sortBy position/sort.ascending | mod-circulation-storage | 10 | 9 |
Fetch circulation rules | GET /circulation/rules | mod-circulation-storage | 18 | 18 |
Fetch loan policy | GET /loan-policy-storage/loan-policies/{id} | mod-circulation-storage | 10 | 8 |
Fetch tenant locale | GET /configurations/entries?query=module=="ORG" and configName=="localeSettings" | mod-configuration | 16 | 10 |
Fetch overdue fines policies | GET /overdue-fines-policies/{id} | mod-feesfines | 19 | 8 |
Fetch lost item fees policies | GET /lost-item-fees-policies/{id} | mod-feesfines | 11 | 10 |
Fetch opening days | GET /calendar/periods/7068e104-aa14-4f30-a8bf-71f71cc15e07/calculateopening?requestedDate={{dueDate}} | mod-calendar | 12 | 8 |
Fetch user (patron) groups | GET /groups?query=id=={groupId} | mod-users | 17 | 7 |
Update item status | PUT /item-storage/items/{id} | mod-inventory-storage | 194 | 13 |
Create loan | POST /loan-storage/loan | mod-circulation-storage | 16 | 8 |
Update patron action session | POST /patron-action-session-storage/patron-action-sessions | mod-circulation-storage | 10 | 7 |
Fetch user | GET /users/{id} | mod-users | 6 | 15 |
Fetch patron notice policy | GET /patron-notice-policy-storage/patron-notice-policies/1a821238-0cd9-48d9-a71a-057d33df0154 | mod-circulation-storage | 6 | 7 |
* The Vega team have already done some work to improve this