Check-in items API Report (Goldenrod)
- PERF-90Getting issue details... STATUS
Overview
In this testing effort the performance of POST /circulation/check-in-by-barcode running in the Goldenrod release - PERF-90Getting issue details... STATUS is tested with 1, 5, 8, and 20 virtual users for 30 minutes.
Backend:
- mod-inventory-storage-19.2.4
- mod-inventory-16.0.1
- mod-circulation-19.0.9
- mod-circulation-storage-12.0.1
- okapi-3.1.1
- mod-authtoken-2.5.1
- mod-permissions-5.11.2
Frontend:
- folio_inventory-4.0.3
- folio_circulation-3.0.1
- Item Check-in (folio_checkin-3.0.1)
Environment:
- 61 back-end modules deployed in 110 ECS services
- 3 okapi ECS services
- 8 m5.large EC2 instances
- 2 db.r5.xlarge AWS RDS instance (1 reader, 1 writer)
- INFO Okapi logging level
High Level Summary
POST /circulation/check-in-by-barcode takes on average 460ms - 500ms for up to 8 users, and much slower for 20 users
- Comparing to Fameflower this Goldenrod version is about 11% - 20% slower, but a couple of tests need to be redone to confirm results.
- There were no slow queries or specific API calls, but the total time of all API calls add up to slowness.
- mod-authtoken calls add up to at least 20% of the overall time.
- mod-circulation appears to have memory leaks
Test Runs and Results
Test | Virtual Users | Duration | OKAPI log level | Profiled | Ramp up (secs) | Average (ms) | 75th %tile (ms) |
---|---|---|---|---|---|---|---|
1. | 1 | 30 min | INFO | No | 1 | 459 | 549 |
2. | 5 | 30 min | INFO | No | 50 | 460 | 483 |
3. | 8 | 30 min | INFO | No | 80 | 502 | 529 |
4. | 20 | 30 min | INFO | No | 200 | 975 | 1093 |
Running this test on Fameflower gives the following comparison:
Virtual Users | FF Average (ms) | GR Average (ms) | Delta % | FF 75th% tile (ms) | GR 75th %tile (ms) | Delta% |
---|---|---|---|---|---|---|
1 | 414 | 459 | 10.87 | 461 | 549 | 19.09 |
5 | 435 | 460 | 5.75% | 458 | 483 | 5.46 |
505 | 502 | -0.59 | 531 | 529 | -0.38 | |
20 | 851 | 978 | 14.57 | 914 | 1093 | 19.58 |
It looks like Goldenrod is slower than Fameflower by about 10%-19%. There were inconclusive tests for 5 and 8 users, or 1 and 20 users, so a new set of tests were rerun against both Goldenrod and Fameflower builds:
Test | Virtual Users | Duration | OKAPI log level | Profiled | Ramp up (secs) | Average (ms) | 75th %tile (ms) |
---|---|---|---|---|---|---|---|
5. | 1 | 30 min | WARNING | No | 1 | 463 | 559 |
6. | 5 | 30 min | WARNING | No | 50 | 466 | 486 |
7. | 8 | 30 min | WARNING | No | 80 | 523 | 550 |
8. | 20 | 30 min | WARNING | No | 200 | 988 | 1072 |
When comparing to a new set of Fameflower response times:
Virtual Users | FF Average (ms) | GR Average (ms) | Delta % | FF 75th% tile (ms) | GR 75th %tile (ms) | Delta% |
---|---|---|---|---|---|---|
1 | 410 | 463 | 12.93 | 455 | 559 | 22.86 |
5 | 420 | 466 | 10.95 | 442 | 486 | 9.95 |
471 | 523 | 11.04 | 499 | 550 | 10.22 | |
20 | 851 | 988 | 16.10 | 909 | 1072 | 17.93 |
A couple things to note about the methodology of the second round of test runs:
- mod-circulation and mod-feesfines were all restarted before each test run because there are indications that these two modules have memory leaks. These leaks over time affect performance and the modules take longer to respond. It was important to have them at the same state before each test run for consistency in time
- OKAPI log level on both builds, FF and GR, were confirmed to be the same. This time they were set at WARNING level, but the level is not relevant in this discussion as long as both builds have the same log level.
Here we see a more consistent and tighter band of delta variability between the test runs. With the second test run results, and also considering the first set of results, we can more confidently conclude that Goldenrod version is about 10-17% slower than Fameflower.
JVM Profiling
- Did not profile because profiling for this API was incorporated in Check in items workflow Test Report (Fameflower)
Database
Database shows that CPU usage increases according with number of users. CPU usage for 1 user is about 3-5 %. CPU usage for the high case of 20 users is 60%.
Virtual Users | CPU % |
---|---|
1 | 8 |
5 | 34% |
8 | 50% |
20 | 65% |
Database Slow queries
There were no slow queries in this API call. Slow queries are those that take longer than 20ms.
CPU Utilization
The following shows CPU utilization percentage of the ECS services running okapi and the relevant nmodules.
1 user | 5 users | 8 users | 20 users | |||||
---|---|---|---|---|---|---|---|---|
Average % | Range % | Average % | Range % | Average % | Range % | Average % | Range % | |
Okapi | 47 | 45-48 | 167 | 156-177 | 228 | 218-238 | 240 | 154-282 |
mod-circulation | 146 | 146-152 | 155 | 145-155 | 135 | 103-159 | 237 | 166-800 |
mod-circulation-storage | 12 | 10.31-14.77 | 30 | 27-41 | 40 | 39-61 | 48 | 47-52 |
mod-inventory | 3.0 | 3.03-3.26 | 3.28 | 3.23-3.49 | 3.16 | 3.0-3.31 | 3.11 | 3.10-3.22 |
mod-inventory-storage | 24 | 17.83-26.11 | 87 | 81-90 | 123 | 114-152 | 142 | 111-159 |
mod-circulation seems to be performing erratically at times. It needs to be focused on in a longevity test.
The graph below shows mod-circulation CPU usage. The written numbers indicate the number of virtual users participated in the test, and the arrow points to when the tests were running. Observations: the CPU usage increased over time for each test run performed. Immediately after each test run the CPU utilization came down but not back to the starting point. This indicates some kind of a leak. For the 20 users run there were a huge spike about 5 minutes into the test run. At this time the container was running out of memory (see the memory graph in the next section) so things became unstable.
Memory
Memory for most of the relevant modules, except for mod-circulation, was stable throughout the runs, only a spike here or there, but in a 30 minutes run they were consistent.
1 user | 5 users | 8 users | 20 users | |
---|---|---|---|---|
Average | Average | Average | Average | |
Okapi | 67% | 67% | 67% | 67% |
mod-circulation | 128% | 120% | 103% | 140%; 47%-97% |
mod-circulation-storage | 45.7 | 45.7 | 45.67% | 45.7% |
mod-inventory | 55% | 55% | 55% | 55% |
mod-inventory-storage | 56% | 56% | 56% | 56% |
Memory was stable for OKAPI and all relevant modules except for mod-circulation which may have a memory leak as testing proceeded in this sequence: 8 users, 5 users, 1 user, 20 users. mod-circulation's average memory utilization climbed from 103% to 140% over 4 tests. mod-circulation was crashed and restarted during the 20 users run, so its memory consumption range is lower than other runs, but it started out at 140%.
mod-circulation memory usage showing when the test was run. The written numbers indicate the number of virtual users participated in the tests. This graph shows an increase of memory usage during the test run that was not released. For the 20 users test run, it started out at a very high level and peaked at 140%, which is when the container crashed and restarted, as seen by the vertical drop off.
Heap dumps were taken for consecutive test runs (not specifically the ones in the diagrams above but at a later date) show an increasing in the number of HTTP Client objects.
Here is the leaked suspects report for a test run, generated by Eclipse's Memory Analyzer tool.
Here is the suspected leaks for the next consecutive run
Note the increasing in the number of HttpClientImpl objects from one run (10,88,480) to another run (19,666,000). Connection managers are increased as well, from 4,001,128 objects to 7,280,328 objects.
Giraffe Diagrams
Giraffe was used to graph the API calls sequences from invoking POST /circulation/check-in-by-barcode API. The graphs are png images in this check-in.zip file. Giraffe diagrams don't show slow calls, but there are two prevalent observations:
- There are lots and lots of API calls. Even if the API calls are fast individually the total time adds up to hundreds of milliseconds.
- There are individual trips to mod-inventory-storage, for example, to get various pieces of data (campuses, material types, service points, etc..)
- For each API call there is an accompanying call to mod-authtoken. This call averages 2ms-5ms. There are spikes of more than 40ms though. In total the calls to mod-authtoken add up to between 139ms (if outlier times are replaced by average times) to 347ms (real time).
- There were no specific slow API calls.
Recommended Improvements
- In mod-circulation needs to be checked for memory leaks. - CIRC-863Getting issue details... STATUS
- mod-authtoken call patterns need to be revisited
- mod-authtoken performance optimizations ( - MODAT-80Getting issue details... STATUS )
- A better and more efficient way to interact with modules need to be considered to minimize trips to the storage modules.
Appendix
- Details/Raw test runs data PERF-90.xlsx
- check-in-giraffe-diagrams.zip
- mod-circulation-gr1_Leak_Suspects.zip
- mod-circulation-gr3_Leak_Suspects.zip