Performance (UXPROD-746)

[UXPROD-3317] Improve checkout performance by caching data Created: 21/Sep/21  Updated: 04/Jan/22  Resolved: 09/Nov/21

Status: Closed
Project: UX Product
Components: None
Affects versions: None
Fix versions: None
Parent: Performance

Type: New Feature Priority: P1
Reporter: Holly Mistlebauer Assignee: Holly Mistlebauer
Resolution: Won't Do Votes: 0
Labels: NFR, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original estimate: Not Specified

Attachments: PNG File screenshot-1.png    
Issue links:
Defines
is defined by CIRC-1301 Cache loan type for checkout Closed
is defined by CIRC-1302 Cache material type for checkout Closed
is defined by CIRC-1303 Cache location for checkout Closed
is defined by CIRC-1304 Cache service point for checkout Closed
is defined by CIRC-1305 Cache institution for checkout Closed
is defined by CIRC-1306 Cache campus for checkout Closed
is defined by CIRC-1307 Cache library for checkout Closed
is defined by CIRC-1308 Cache instance for checkout Closed
is defined by CIRC-1309 Cache holdings for checkout Closed
is defined by CIRC-1310 Cache item by barcode for checkout Closed
is defined by CIRC-1311 Cache loan policy for checkout Closed
is defined by CIRC-1312 Cache circulation rules for checkout Closed
is defined by CIRC-1313 Cache tenant locale for checkout Closed
is defined by CIRC-1314 Cache lost item fees policies for che... Closed
is defined by CIRC-1315 Cache overdue fines policies for chec... Closed
is defined by CIRC-1316 Cache user (patron) groups for checkout Closed
is defined by CIRC-1317 Cache user for checkout Closed
Epic Link: Performance
Front End Estimate: Out of scope
Front-End Confidence factor: Low
Back End Estimate: Jumbo: > 45 days
PO Rank: 0

 Description   

Overview:
Cornell reports that check-ins and checkouts range from 1 second to 5 seconds (and sometimes up to 11 seconds). Missouri State reports the same rate. We need to improve the processing time for check-ins and checkouts.

Marc Johnson has created a proposal at https://folio-org.atlassian.net/wiki/x/DgJU. After reviewing the proposal, the Capacity Planning Team has determined that we should proceed with the caching approach. Marc is in the process of creating a document outlining the technical aspects for the devs.

Steps:

  1. Ask assigned team to come up with approach within 2 weeks.
  2. Try out this caching approach on one record type, choosing the one with the biggest impact.
  3. After we are satisfied with the process, implement caching for as many other record types as we are able during the release. Need to prioritize the rest of the record types so that we get the heavy hitters first.
  4. Have the PTF team analyze the results of caching work completed.
  5. Discuss impact of caching with Resource Access SIG (i.e. if cached record is more than X minutes old, refresh it). We are waiting until we know the impact of caching on the response time so that we are able to present the process with as much information as possible.
  6. Determine next steps based on new PTF team analysis.

The "is defined by" stories for this feature should be worked on in this order...

Recommended approach to take..

  • cache expiration of 5 seconds for all record types
  • maximum cache size of 1000 records (this is pure speculation, as we don't know what impact the caching will have on memory usage)


 Comments   
Comment by Holly Mistlebauer [ 21/Sep/21 ]

Marc Johnson: Which record types should we cache first? I would like to create tickets for at least 5.

Comment by Marc Johnson [ 21/Sep/21 ]

Holly Mistlebauer

Which record types should we cache first? I would like to create tickets for at least 5.

I don't think it would be appropriate for me to make that decision.

My preference would be for the RA SIG or relevant POs to decide. Cheryl Malmborg do you have a preference?

Given the conversation in Capacity Planning, maybe Khalilah Gambrell or Hkaplanian have some thoughts on which should be chosen?

Comment by Hkaplanian [ 21/Sep/21 ]

The path forward here is to take the list of requests that the circ apps make that take the most time and from a technical perspective look at what database lookups would give us the most time savings.  We already have that data at call level (I believe).  In many ways, this is a POC and I assume multiple questions will arise both during and after where the SIG might be able to help us clarify, but at this stage, we have the data needed to make that decision.  We already have the list I refer to.  Marc, it's up to you at this stage.

 

Comment by Holly Mistlebauer [ 24/Sep/21 ]

Marc Johnson: I have looked at the response time of each "Intent" using the data available at https://folio-org.atlassian.net/wiki/x/DgJU.

The "Intents" with the highest response times are...

  • Fetch automated blocks (546 ms)
  • Fetch item barcode (163 ms)
  • Update item status (194 ms)
  • Fetch manual patron blocks (133 ms)

I am assuming that we can only cache the "Fetch" "Intents", so "Update item status" is out. Two of the "Fetches" have had an improvement made already ("Fetch automated blocks" and "Fetch item barcode" and one has a separate issue for the Vega team to address ("Fetch manual patron blocks"). Marc Johnson: Should we include any of these 4 in the group of 5 we start with?

If not, I am thinking we could do these 5...

  • Fetch holdings (57 ms)
  • Fetch instance (22 ms)
  • Fetch overdue fine policy (19 ms)
  • Fetch circulation rules (18 ms)
  • Fetch lost item fee policy (11 ms)

Thoughts?

cc: Hkaplanian; Khalilah Gambrell

Comment by Hkaplanian [ 24/Sep/21 ]

Marc can correct me, but I think a good set of criteria could be:

  1. Which of the items listed are run for each and every checkout? 
  2. Of those, which are relatively small tables that can be loaded into memory easily.  Ex: Barcode file is probably too large and unique for every scan.  Circ rules can be reused with each and every scan and don't change often and could be a good candidate.  Same for fine policies.  
  3. But, Marc Johnson ,looking at the small time periods it takes, do any of those pay?  Are they also used when looking up automated blocks which could give us better payback?  Is there any real savings to be had here?
Comment by Hkaplanian [ 28/Sep/21 ]

Marc, looking at this list:

  • Fetch holdings (57 ms)
  • Fetch instance (22 ms)
  • Fetch overdue fine policy (19 ms)
  • Fetch circulation rules (18 ms)
  • Fetch lost item fee policy (11 ms)

During a loan, are these called for the same data multiple times?  Once per item?  2x per item?  IS fetching the lost items fee policy worth it if in total it only saves 11 ms?  Just wondering...

Comment by Marc Johnson [ 28/Sep/21 ]

Hkaplanian

During a loan, are these called for the same data multiple times? Once per item? 2x per item?

They should be only requested once per check out (which is only for a single item)

IS fetching the lost items fee policy worth it if in total it only saves 11 ms? Just wondering...

looking at the small time periods it takes, do any of those pay? Are they also used when looking up automated blocks which could give us better payback? Is there any real savings to be had here?

Not really, not on their own. It's also worth remembering that this sample is likely misleading (it's all we've got to go on).

I think this is where evaluating if this approach is improving the performance is going to be challenging.

The current performance is due to the cumulative effect of many requests. That likely means we will only get fairly small (maybe not really any) improvement from stopping any one request.

As we don't know how each of these requests degrade under load (we only have a sample under no load and the overall check out API response times), it is challenging to know which ones put the system under pressure and which ones become more significant constraints under load.

What this all means is that we aren't likely to know how well we've done until multiple record types have been done and a full load test has been conducted. This makes getting timely feedback on whether we've chosen the right approach and record types challenging.

Comment by Marc Johnson [ 28/Sep/21 ]

Hkaplanian

Which of the items listed are run for each and every checkout?

Most of them will be fetched for every request, these are the ones I'm confident of:

  • item
  • holdings record
  • instance
  • location
  • library
  • campus
  • institution
  • material type
  • loans
  • requests
  • user
  • user group
  • patron blocks (both manual and automatic)
  • service point
  • loan policy
  • circulation rules (although there is already caching here)
  • tenant locale

Loans, requests and items are poor candidates for caching (though not for other forms of derived data, this is why my preference was for persistent derived data) as I imagine check outs for the same item in a short time frame are rare. I don't know what the impact of title level requests will be on this area.

I cannot answer that authoritatively without much more analysis of all of the code paths in the system.

Of those, which are relatively small tables that can be loaded into memory easily. Circ rules can be reused with each and every scan and don't change often and could be a good candidate. Same for fine policies.

Can you help me understand why you are asking this question?

The approach that we've chosen (on demand, partial caching) means that we likely won't be loading the entire set of records for any record type into memory and if we do, it will be one at a time.

Ex: Barcode file is probably too large and unique for every scan.

I believe the unique barcode changes have been aborted for 2021 R2 and maybe 2021 R3 due to some organisations not being ready for this change.

Comment by Marc Johnson [ 29/Sep/21 ]

Hkaplanian Holly Mistlebauer

I think I've answered the current questions asked and provided my thoughts on implementation. Please let me know if you need anything else at the moment.

I am assuming that we can only cache the "Fetch" "Intents", so "Update item status" is out.

Yes, state changes cannot be avoided (at least not with the current process design).

Two of the "Fetches" have had an improvement made already ("Fetch automated blocks" and "Fetch item barcode" and one has a separate issue for the Vega team to address ("Fetch manual patron blocks"). Marc Johnson: Should we include any of these 4 in the group of 5 we start with?

I think we should exclude any of the operations that we have decided to dedicate separate work from the caching for the moment, in order to understand the impact of those improvements separately (ish, depending upon frequency of performance testing) from the caching changes.

My Proposed Ordering

Given that we aren't going to work with the RA SIG or other stakeholders to understand the tolerances they might accept. And that the response times for most of the record fetches are of a similar magnitude.

I think that it makes sense to start with the ones (that I think are) less likely to change and / or the impact of changes will likely be low.

The policies are in a slightly strange place in this list, I've put them a little higher than the potential negative impact might suggest, because we might want to get to some of that feedback sooner rather than later.

Here is my proposed ordering (my reasoning in brackets):

  • tenant locale (singular, is common to all check outs, should change very rarely)
  • loan type (probably small set, likely common to some check outs, low impact if inconsistent)
  • patron group (probably small set, likely common to some check outs, low impact if inconsistent)
  • material type (unsure of set size, likely common to some check outs, low impact if inconsistent)
  • location (unsure of set size, likely common to some check outs, low impact if inconsistent)
  • service point (unsure of set size, likely common to some check outs, low impact if inconsistent)
  • loan policy (small set, likely common to many check outs, possible high impact if inconsistent)
  • lost item policy (small set, likely common to many check outs, possible high impact if inconsistent)
  • overdue fine policy (small set, likely common to many check outs, possible high impact if inconsistent)
  • institution (unsure of set size, likely common to some check outs, low impact if inconsistent)
  • campus (unsure of set size, likely common to some check outs, low impact if inconsistent)
  • library (unsure of set size, likely common to some check outs, low impact if inconsistent)
  • circulation rules (already cached, we may want to replace this with a similar cache to what we implement in other places)
  • instance (large set, unlikely to be common to many check outs)
  • holdings record (large set, unlikely to be common to many check outs)
  • user (large set, unlikely to be common to many check outs)
  • item (large set, not common to any successful check outs within the time frame)

Cache Policies

I suggest we start with

  • a cache expiration of 5 seconds for all record types
  • a maximum cache size of 1000 records (this is pure speculation, as we don't know what impact the caching will have on memory usage)

Both of these should be runtime configurable so the PTF team (and other operational folks) can tweak them.

Comment by Holly Mistlebauer [ 29/Sep/21 ]

Khalilah Gambrell and Hkaplanian: Hi! I have created the stories for this feature. Should I assign this to Vega? Thanks...
cc: Marc Johnson

Comment by Julian Ladisch [ 05/Oct/21 ]

If optimistic locking is enabled for a table getting an outdated record and using it for PUT will result in an optimistic locking failure that will persist when reloading the record until the cache expiration time has been reached.

This can be avoided by invalidating the cache for that record when doing a PUT.

Comment by Holly Mistlebauer [ 09/Nov/21 ]

It was decided that caching data would not give us level of performance improvement we need.

Generated at Fri Feb 09 00:31:04 UTC 2024 using Jira 1001.0.0-SNAPSHOT#100246-sha1:7a5c50119eb0633d306e14180817ddef5e80c75d.