PERF-90 - Getting issue details... STATUS

Overview

In this testing effort the performance of POST /circulation/check-in-by-barcode running in the Goldenrod release PERF-90 - Getting issue details... STATUS is tested with 1, 5, 8, and 20 virtual users for 30 minutes.

Backend:

mod-inventory-storage-19.2.4
mod-inventory-16.0.1
mod-circulation-19.0.9
mod-circulation-storage-12.0.1
okapi-3.1.1
mod-authtoken-2.5.1
mod-permissions-5.11.2

Frontend:

folio_inventory-4.0.3
folio_circulation-3.0.1
Item Check-in (folio_checkin-3.0.1)

Environment:

61 back-end modules deployed in 110 ECS services
3 okapi ECS services
8 m5.large EC2 instances
2 db.r5.xlarge AWS RDS instance (1 reader, 1 writer)
INFO Okapi logging level

High Level Summary

POST /circulation/check-in-by-barcode takes on average 460ms - 500ms for up to 8 users, and much slower for 20 users
Comparing to Fameflower this Goldenrod version is about 11% - 20% slower, but a couple of tests need to be redone to confirm results.
There were no slow queries or specific API calls, but the total time of all API calls add up to slowness.
mod-authtoken calls add up to at least 20% of the overall time.
mod-circulation appears to have memory leaks

Test Runs and Results

Test	Virtual Users	Duration	OKAPI log level	Profiled	Ramp up (secs)	Average (ms)	75th %tile (ms)
1.	1	30 min	INFO	No	1	459	549
2.	5	30 min	INFO	No	50	460	483
3.	8	30 min	INFO	No	80	502	529
4.	20	30 min	INFO	No	200	975	1093

Running this test on Fameflower gives the following comparison:

Virtual Users	FF Average (ms)	GR Average (ms)	Delta %	FF 75th% tile (ms)	GR 75th %tile (ms)	Delta%
1	414	459	10.87	461	549	19.09
5	435	460	5.75%	458	483	5.46
8	505	502	-0.59	531	529	-0.38
20	851	978	14.57	914	1093	19.58

It looks like Goldenrod is slower than Fameflower by about 10%-19%. There were inconclusive tests for 5 and 8 users, or 1 and 20 users, so a new set of tests were rerun against both Goldenrod and Fameflower builds:

Test	Virtual Users	Duration	OKAPI log level	Profiled	Ramp up (secs)	Average (ms)	75th %tile (ms)
5.	1	30 min	WARNING	No	1	463	559
6.	5	30 min	WARNING	No	50	466	486
7.	8	30 min	WARNING	No	80	523	550
8.	20	30 min	WARNING	No	200	988	1072

When comparing to a new set of Fameflower response times:

Virtual Users	FF Average (ms)	GR Average (ms)	Delta %	FF 75th% tile (ms)	GR 75th %tile (ms)	Delta%
1	410	463	12.93	455	559	22.86
5	420	466	10.95	442	486	9.95
8	471	523	11.04	499	550	10.22
20	851	988	16.10	909	1072	17.93

A couple things to note about the methodology of the second round of test runs:

mod-circulation and mod-feesfines were all restarted before each test run because there are indications that these two modules have memory leaks. These leaks over time affect performance and the modules take longer to respond. It was important to have them at the same state before each test run for consistency in time
OKAPI log level on both builds, FF and GR, were confirmed to be the same. This time they were set at WARNING level, but the level is not relevant in this discussion as long as both builds have the same log level.

Here we see a more consistent and tighter band of delta variability between the test runs. With the second test run results, and also considering the first set of results, we can more confidently conclude that Goldenrod version is about 10-17% slower than Fameflower.

JVM Profiling

Did not profile because profiling for this API was incorporated in Check in items workflow Test Report (Fameflower)

Database

Database shows that CPU usage increases according with number of users. CPU usage for 1 user is about 3-5 %. CPU usage for the high case of 20 users is 60%.

Virtual Users	CPU %
1	8
5	34%
8	50%
20	65%

Database Slow queries

There were no slow queries in this API call. Slow queries are those that take longer than 20ms.

CPU Utilization

The following shows CPU utilization percentage of the ECS services running okapi and the relevant nmodules.

	1 user		5 users		8 users		20 users
	Average %	Range %	Average %	Range %	Average %	Range %	Average %	Range %
Okapi	47	45-48	167	156-177	228	218-238	240	154-282
mod-circulation	146	146-152	155	145-155	135	103-159	237	166-800
mod-circulation-storage	12	10.31-14.77	30	27-41	40	39-61	48	47-52
mod-inventory	3.0	3.03-3.26	3.28	3.23-3.49	3.16	3.0-3.31	3.11	3.10-3.22
mod-inventory-storage	24	17.83-26.11	87	81-90	123	114-152	142	111-159

mod-circulation seems to be performing erratically at times. It needs to be focused on in a longevity test.

The graph below shows mod-circulation CPU usage. The written numbers indicate the number of virtual users participated in the test, and the arrow points to when the tests were running. Observations: the CPU usage increased over time for each test run performed. Immediately after each test run the CPU utilization came down but not back to the starting point. This indicates some kind of a leak. For the 20 users run there were a huge spike about 5 minutes into the test run. At this time the container was running out of memory (see the memory graph in the next section) so things became unstable.

Memory

Memory for most of the relevant modules, except for mod-circulation, was stable throughout the runs, only a spike here or there, but in a 30 minutes run they were consistent.

	1 user	5 users	8 users	20 users
	Average	Average	Average	Average
Okapi	67%	67%	67%	67%
mod-circulation	128%	120%	103%	140%; 47%-97%
mod-circulation-storage	45.7	45.7	45.67%	45.7%
mod-inventory	55%	55%	55%	55%
mod-inventory-storage	56%	56%	56%	56%

Memory was stable for OKAPI and all relevant modules except for mod-circulation which may have a memory leak as testing proceeded in this sequence: 8 users, 5 users, 1 user, 20 users. mod-circulation's average memory utilization climbed from 103% to 140% over 4 tests. mod-circulation was crashed and restarted during the 20 users run, so its memory consumption range is lower than other runs, but it started out at 140%.

mod-circulation memory usage showing when the test was run. The written numbers indicate the number of virtual users participated in the tests. This graph shows an increase of memory usage during the test run that was not released. For the 20 users test run, it started out at a very high level and peaked at 140%, which is when the container crashed and restarted, as seen by the vertical drop off.

Heap dumps were taken for consecutive test runs (not specifically the ones in the diagrams above but at a later date) show an increasing in the number of HTTP Client objects.

Here is the leaked suspects report for a test run, generated by Eclipse's Memory Analyzer tool.

Here is the suspected leaks for the next consecutive run

Note the increasing in the number of HttpClientImpl objects from one run (10,88,480) to another run (19,666,000). Connection managers are increased as well, from 4,001,128 objects to 7,280,328 objects.

Giraffe Diagrams

Giraffe was used to graph the API calls sequences from invoking POST /circulation/check-in-by-barcode API. The graphs are png images in this check-in.zip file. Giraffe diagrams don't show slow calls, but there are two prevalent observations:

There are lots and lots of API calls. Even if the API calls are fast individually the total time adds up to hundreds of milliseconds.
- There are individual trips to mod-inventory-storage, for example, to get various pieces of data (campuses, material types, service points, etc..)
For each API call there is an accompanying call to mod-authtoken. This call averages 2ms-5ms. There are spikes of more than 40ms though. In total the calls to mod-authtoken add up to between 139ms (if outlier times are replaced by average times) to 347ms (real time).
There were no specific slow API calls.

Recommended Improvements

In mod-circulation needs to be checked for memory leaks. CIRC-863 - Getting issue details... STATUS
mod-authtoken call patterns need to be revisited
mod-authtoken performance optimizations ( MODAT-80 - Getting issue details... STATUS )
A better and more efficient way to interact with modules need to be considered to minimize trips to the storage modules.

Appendix

Heap dumps are available, but they are huge so they are not attached here.

Folio Development Teams

Check-in items API Report (Goldenrod)