Check-in items API Report (Goldenrod)












PERF-90 - Getting issue details... STATUS

Overview

In this testing effort the performance of POST /circulation/check-in-by-barcode running in the Goldenrod release PERF-90 - Getting issue details... STATUS  is tested with 1, 5, 8, and 20 virtual users for 30 minutes.  


Backend:

  • mod-inventory-storage-19.2.4
  • mod-inventory-16.0.1
  • mod-circulation-19.0.9
  • mod-circulation-storage-12.0.1
  • okapi-3.1.1
  • mod-authtoken-2.5.1
  • mod-permissions-5.11.2

Frontend:

  • folio_inventory-4.0.3
  • folio_circulation-3.0.1
  • Item Check-in (folio_checkin-3.0.1)

Environment:

  • 61 back-end modules deployed in 110 ECS services
  • 3 okapi ECS services
  • 8 m5.large  EC2 instances
  • 2 db.r5.xlarge AWS RDS instance (1 reader, 1 writer)
  • INFO Okapi logging level

High Level Summary

  1. POST /circulation/check-in-by-barcode takes on average 460ms - 500ms for up to 8 users, and much slower for 20 users

  2. Comparing to Fameflower this Goldenrod version is about 11% - 20% slower, but a couple of tests need to be redone to confirm results.
  3. There were no slow queries or specific API calls, but the total time of all API calls add up to slowness.
  4. mod-authtoken calls add up to at least 20% of the overall time.
  5. mod-circulation appears to have memory leaks

Test Runs and Results

Test

Virtual Users

Duration

OKAPI log level

Profiled

Ramp up (secs)

Average (ms)75th %tile (ms)

1. 

1

30 min

INFO

No

1

459549

2. 

5

30 min

INFO

No

50

460483

3. 

8

30 min

INFO

No

80

502529

4. 

20

30 min

INFO

No

200

9751093

Running this test on Fameflower gives the following comparison:

Virtual Users

FF Average (ms)

GR Average (ms)Delta %FF 75th% tile (ms) GR 75th %tile (ms)Delta%
141445910.8746154919.09
54354605.75%4584835.46
8505502-0.59531529-0.38
2085197814.57914109319.58

It looks like Goldenrod is slower than Fameflower by about 10%-19%. There were inconclusive tests for 5 and 8 users, or 1 and 20 users, so a new set of tests were rerun against both Goldenrod and Fameflower builds:

Test

Virtual Users

Duration

OKAPI log level

Profiled

Ramp up (secs)

Average (ms)75th %tile (ms)

5. 

1

30 min

WARNING

No

1

463559

6. 

5

30 min

WARNING

No

50

466486

7. 

8

30 min

WARNING

No

80

523550

8. 

20

30 min

WARNING

No

200

9881072

When comparing to a new set of Fameflower response times:

Virtual Users

FF Average (ms)

GR Average (ms)Delta %FF 75th% tile (ms) GR 75th %tile (ms)Delta%
141046312.9345555922.86
542046610.954424869.95
847152311.0449955010.22
2085198816.10909107217.93

A couple things to note about the methodology of the second round of test runs: 

  1. mod-circulation and mod-feesfines were all restarted before each test run because there are indications that these two modules have memory leaks.  These leaks over time affect performance and the modules take longer to respond. It was important to have them at the same state before each test run for consistency in time
  2. OKAPI log level on both builds, FF and GR, were confirmed to be the same. This time they were set at WARNING level, but the level is not relevant in this discussion as long as both builds have the same log level.

Here we see a more consistent and tighter band of delta variability between the test runs. With the second test run results, and also considering the first set of results, we can more confidently conclude that Goldenrod version is about 10-17% slower than Fameflower.

JVM Profiling

Database

Database shows that CPU usage increases  according with number of users. CPU usage for 1 user is  about 3-5 %.   CPU usage for the high case of 20 users is 60%.

Virtual UsersCPU %
18
534%
850%
2065%

Database Slow queries 

There were no slow queries in this API call.  Slow queries are those that take longer than 20ms. 


CPU Utilization

The following shows CPU utilization percentage of the ECS services running okapi and the relevant nmodules.


1 user5 users8 users20 users

Average %Range %Average %Range %Average %Range %Average %Range %
Okapi4745-48167156-177228218-238240154-282
mod-circulation146146-152155145-155135103-159237166-800
mod-circulation-storage1210.31-14.773027-414039-614847-52
mod-inventory3.03.03-3.263.283.23-3.493.163.0-3.313.113.10-3.22
mod-inventory-storage2417.83-26.118781-90123114-152142111-159

mod-circulation seems to be performing erratically at times.  It needs to be focused on in a longevity test.

The graph below shows mod-circulation CPU usage.  The written numbers indicate the number of virtual users participated in the test, and the arrow points to when the tests were running.  Observations: the CPU usage increased over time for each test run performed. Immediately after each test run the CPU utilization came down but not back to the starting point. This indicates some kind of a leak.  For the 20 users run there were a huge spike about 5 minutes into the test run. At this time the container was running out of memory (see the memory graph in the next section) so things became unstable. 

Memory

Memory for most of the relevant modules, except for mod-circulation, was stable throughout the runs, only a spike here or there, but in a 30 minutes run they were consistent. 


1 user5 users8 users20 users

AverageAverageAverageAverage
Okapi67%67%67%67%
mod-circulation128%120%103%140%; 47%-97%
mod-circulation-storage45.745.745.67%45.7%
mod-inventory55%55%55%55%
mod-inventory-storage56%56%56%56%

Memory was stable for OKAPI and all relevant modules except for mod-circulation which may have a memory leak as testing proceeded in this sequence: 8 users, 5 users, 1 user, 20 users.  mod-circulation's average memory utilization climbed from 103% to 140% over 4 tests. mod-circulation was crashed and restarted during the 20 users run, so its memory consumption range is lower than other runs, but it started out at 140%. 

mod-circulation memory usage showing when the test was run. The written numbers indicate the number of virtual users participated in the tests. This graph shows an increase of memory usage during the test run that was not released. For the 20 users test run, it started out at a very high level and peaked at 140%, which is when the container crashed and restarted, as seen by the vertical drop off.

Heap dumps were taken for consecutive test runs (not specifically the ones in the diagrams above but at a later date) show an increasing in the number of HTTP Client objects.

Here is the leaked suspects report for a test run, generated by Eclipse's Memory Analyzer tool.

Here is the suspected leaks for the next consecutive run

Note the increasing in the number of HttpClientImpl objects from one run (10,88,480) to another run (19,666,000). Connection managers are increased as well, from 4,001,128 objects to 7,280,328 objects.

Giraffe Diagrams

Giraffe was used to graph the API calls sequences from invoking POST /circulation/check-in-by-barcode API.  The graphs are png images in this check-in.zip file. Giraffe diagrams don't show slow calls, but there are two prevalent observations:

  • There are lots and lots of API calls. Even if the API calls are fast individually the total time adds up to hundreds of milliseconds.
    • There are individual trips to mod-inventory-storage, for example, to get various pieces of data (campuses, material types, service points, etc..)
  • For each API call there is an accompanying call to mod-authtoken. This call averages 2ms-5ms. There are spikes of more than 40ms though. In total the calls to mod-authtoken add up to between 139ms (if outlier times are replaced by average times) to 347ms (real time).
  • There were no specific slow API calls.

Recommended Improvements 

    • In mod-circulation needs to be checked for memory leaks.  CIRC-863 - Getting issue details... STATUS
    • mod-authtoken call patterns need to be revisited
    • mod-authtoken performance optimizations ( MODAT-80 - Getting issue details... STATUS )
    • A better and more efficient way to interact with modules need to be considered to minimize trips to the storage modules.

Appendix

Heap dumps are available, but they are huge so they are not attached here.