Overview
This is a report for a series of Check-in-check-out test runs against the Iris release.
Backend:
- mod-circulation-22.0.0
- mod-circulation-storage-13.0.0
- mod-inventory-17.0.1
- mod-inventory-storage-21.0.0
- mod-authtoken-2.8.0
- mod-pubsub-2.3.1
- mod-patron-blocks-1.3.0
- mod-feesfines-16.1.0
- okapi-4.8.2
Frontend:
- folio_circulation-5.1.0
- Item Check-in (folio_checkin-5.1.0)
- Item Check-out (folio_checkout-6.1.0)
Environment:
- 72 back-end modules deployed in 136 ECS services
- 3 okapi ECS services
- 12 m5.large EC2 instances
- 2 db.r5.xlarge AWS RDS instance
- INFO logging level
High Level Summary
- GET automated-patron-blocks response time gets worse even with 1 user (46%) and exponentially worse with 5 users (53%), 8 users (78%), 20 users (259%) - MODPATBLK-94Getting issue details... STATUS
- Database quickly runs out of memory, 2GB per 2 hours in a 8 users test, could be due to mod-patron-block queries. - RMB-863Getting issue details... STATUS
- Check-in time improved compared to Iris.
- See below for API calls consuming high CPU utilization percentage.
- mod-inventory-storage appears to have a slow gradual memory leak - MODINVSTOR-751Getting issue details... STATUS
Test Runs
Test | Virtual Users | Duration |
1. | 1 | 30 mins |
2. | 5 | 30 mins |
3. | 8 | 30 mins |
4. | 20 | 30 mins |
5. | 1 | 30 mins |
6. | 5 | 30 mins |
7. | 8 | 30 mins |
8. | 20 | 30 mins |
9. | 1 | 30 mins |
10. | 5 | 30 mins |
11. | 8 | 30 mins |
12. | 20 | 30 mins |
13. | 20 | 1 day |
Results
Response Times (Average of all tests listed above, in seconds)
Average (seconds) | 50th %tile (seconds) | 75th %tile (seconds) | 95th %tile (seconds) | |||||
Check-in | Check-out | Check-in | Check-out | Check-in | Check-out | Check-in | Check-out | |
1 user | 0.94 | 2.582 | 0.85 | 2.40 | 1.106 | 2.75 | 1.42 | 3.83 |
5 users | 1.00 | 2.98 | 0.91 | 2.72 | 1.13 | 3.36 | 1.68 | 4.9 |
8 users | 1.43 | 3.93 | 1.29 | 3.62 | 1.67 | 4.55 | 2.54 | 6.69 |
20 users | 4.75 | 13.35 | 4.63 | 12.53 | 5.95 | 16.50 | 8.21 | 23.67 |
The following table shows the slow 75th percentile APIs taking more than 100 ms to return, and also comparing them against Goldenrod's. Other than the 1-user test, starting with the 5-users test, all listed APIs are slower with GET automated-patron-blocks leading the way at 150% slower, while GET circulation/loans regressed up to 60%.
Average Response Time in Seconds. Note: IR = Iris build, JP = Juniper build.
API | 1 user IR (75th %tile) | 1 user JP (75th %tile) | 5 users IR (75th %tile) | 5 users JP (75th %tile) | 8 users IR (75th %tile) | 8 users JP (75th %tile) | 20 users IR (75th %tile) | 20 users JP (75th %tile) |
---|---|---|---|---|---|---|---|---|
POST checkout-by-barcode | 1.08 | 1.08 | 1.43 | 1.37 | 1.80 | 1.74 | 3.59 | 5.62 |
POST checkin-by-barcode | 0.68s | 0.58 | 0.98 | 0.63 | 1.41 | 0.93 | 2.89 | 1.83 |
GET automated-patron-blocks | 0.28 | 0.41 | 0.40 | 0.61 | 0.50 | 0.89 | 1.06 | 3.81 |
GET circulation/loans | 0.36 | 0.35 | 0.60 | 0.52 | 0.78 | 0.62 | 1.57 | 1.14 |
GET inventory/items | 0.35 | 0.24 | 0.44 | 0.29 | 0.55 | 0.42 | 1.13 | 3.97 |
- POST checkout-by-barcode response time IMPROVES up to 8 users, more than 8 users (i.e., 20 users it's a lot worse)
- POST checkin-by-barcode response time IMPROVES throughout
- GET automated-patron-blocks response time gets worse even with 1 user (46%) and exponentially worse with 5 users (53%), 8 users (78%), 20 users (259%)
- GET circulation/loans response time IMPROVES throughout
- GET inventory/items response time IMPROVES up to 8 users, more than 8 users (i.e., 20 users it's a lot worse)
Average | 50th Percentile | 75th percentile | 95th percentile | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Check-in IR | Check-in JP | Check-out IR | Check-out JP | Check-in IR | Check-in JP | Check-out IR | Check-in JP | Check-in IR | Check-in JP | Check-out IR | Check-out JP | Check-in IR | Check-in JP | Check-out IR | Check-out JP | |
1 user | 1.16 | 0.93 | 2.56 | 2.58 | 1.02 | 0.85 | 2.35 | 2.40 | 1.17 | 1.02 | 2.64 | 2.76 | 1.83 | 1.42 | 3.78 | 3.83 |
5 users | 1.49 | 1.00 | 3.42 | 2.98 | 1.27 | 0.91 | 284 | 2.72 | 1.65 | 1.13 | 3.55 | 3.35 | 2.91 | 1.69 | 5.89 | 4.9 |
8 users | 1.91 | 1.43 | 3.90 | 3.93 | 1.63 | 1.29 | 3.44 | 3.62 | 2.28 | 1.68 | 4.48 | 4.55 | 3.85 | 2.54 | 7.01 | 6.69 |
20 users | 3.73 | 4.75 | 7.36 | 13.33 | 3.34 | 4.63 | 6.79 | 12.53 | 4.60 | 5.95 | 8.88 | 16.50 | 7.10 | 8.20 | 13.06 | 23.68 |
- Check-in's response time improves up to 8 users by as much as 25%.
- Check-out's response time is about the same as in Iris up to 8 users.
- With 20 users, check-in's response time grew to 27% worse and check-out response's time went to 81% worse from 1% worse (in 8 users).
- More tests were conducted to see the overall trend. Response times of 14 and 25 users tests also confirm a steep rise in check-out response time of 20 users test.
- Response time graph shows a sharp rise in check-out response time (in seconds) when there are more than 8 users (i.e., 14, 20, 25 users).
- The Delta % vs. Iris shows the difference in response time compared to Iris. Both check-in/out are doing pretty well compared to Iris until after 8 users when the difference percentage took off to nearly 20% for check-in and 100% for checkout when there are 20 users.
- More tests were conducted to see the overall trend. Response times of 14 and 25 users tests also confirm a steep rise in check-out response time of 20 users test.
Longevity test
A 24-hour longevity test was performed with 20 concurrent users. Key observations:
- Catastrophic failure after 14.5 hours. All API calls ceased to respond.
- The catastrophic failure can be explained by the database struggling to keep up with the 20 users load when both the DB CPU utilization rate is off the chart and the DB's freeable memory was quickly exhausted after 2+ hours of testing. For more impact on the database, see the DB analysis below. Consequently, when the DB lagging behind, the modules started to time out and instability such as stale/dead connections began to happen.
As in Iris, Okapi started out using high level of CPU but gradually used less. Meanwhile mod-inventory-storage slowly uses more CPU as the test progressed. Other modules' CPU utilization are nominal.
Perhaps mod-inventory-storage's increase CPU is also due to an increase in memory usage. This indicates a memory leaks. Heap dump shows a leak of io.vertx.core.impl.DuplicatedContext. See - MODINVSTOR-751Getting issue details... STATUS . Other modules' did not exhibit similar traits of leakage.
Modules CPUs and Memory
The graph below shows tests comparing CPU utilizations of modules involved in the check-in/out workflow among 6 short 30-minutes tests.
Notable observations:
- Okapi's CPU utilization is rather high starting with 5 and 8 users. With 5 concurrent users Okapi already uses nearly 200% of CPU, 8 users almost 300%.
- mod-patron-block is the second worst offender, consuming a high level of CPU.
- Other modules are nominal in their CPU utilization.
The graph below shows the memory usage of check-in/out modules in the short 30-minutes tests. No memory leaks observed in these short tests.
Database
Juniper seems to be using memory quite aggressively for 8 users tests and certainly 20 user tests. Below is the freeable memory graph comparing the 1, 5, 8, and 20 users 30-minutes tests. When each test runs there is a steep drop off follows by a gradual decline of memory consumption. When the test is over the memory is returned to the pre-test level of 7GB. What to note here is that in 30 minutes, even a 5-users test consumes more than 1GB of memory, 8 users nearly 2GB, and 20 users is 2GB.
Meanwhile the DB's CPU utilizations are as follows for this set of tests:
With 5 concurrent users the DB's utilization rate is already averaging 50%, 8 users nearly 80%, and 20 users at 100% but clearly it needs more CPU resources than what is available for it.
Following the trends above, the 20 users 24-hour longevity test comes out like this:
In this graph the database started out with about 7GB and in less than 3 hours all memory was used. The DB tried to recover with spikes for about 4 hours, then it crashed and restarted itself memory level resetting at 7GB. This time it took less than 2 hours for the memory to be exhausted. The DB once again restarted and about 2 hours latera final crash. The 24-hour ran for about 15 hours.
DB's CPU utilization rate is pegged at 100% for most of the test duration, only comes down when the DB crashed.
The database logs show the following queries being notoriously slow:
duration: 2159.699 ms statement: SELECT jsonb FROM fs09000000_mod_patron_blocks.item_checked_out_event WHERE (jsonb->>'userId') = 'a2a25161-e99f-433f-be37-2d08168ab589' LIMIT 10000 duration: 2154.077 ms statement: EXPLAIN ANALYZE SELECT jsonb FROM fs09000000_mod_patron_blocks.item_checked_out_event WHERE (jsonb->>'userId') = '1d4adbf2-4a0c-4777-b495-f29d45bd1711' LIMIT 10000 duration: 2527.494 ms statement: SELECT fs09000000_mod_patron_blocks.count_estimate('SELECT jsonb FROM fs09000000_mod_patron_blocks.item_checked_out_event WHERE (jsonb->>''userId'') = ''776665e5-5705-40b4-b8d6-4a5634342620''')
This SQL statement being slow and is even slower when wrapped by the count_estimate() function. This statement takes anywhere from 200ms to over 2s to execute, depending on the load and the availability of CPU resources. In the past when count_estimate() was slow a similar behavior in aggressive memory consumption was observed. Database logs show lots of the above count_estimate() calls and it could explain for high memory consumption this time.
Analysis:
item_checked_out_event userId=="1d4adbf2-4a0c-4777-b495-f29d45bd1711" LIMIT 10000
already has a correct b-tree index (schema) since September 2020 (MODPATBLK-48 = PR 37). The long execution time of 2154 ms is caused by the high limit of 10000 records to fetch.
To calculate the user summary the code fetches all events for a user, sorts them by time, and creates a summary of it: rebuild source code
If there were 5000 loans during the last 20 years for the given userId all 5000 item_checked_out_events are loaded and processed.
One possible way to improve the performance: For each user store the timestamp of the latest event that has been processed at the last user summary calculation. Next time when rebuilding the user summary only fetch new events (events after that timestamp). To make this fast create a combined index userId + metadata.createdDate and use a special SQL query jsonb->>'userId'=$1 AND (jsonb->'metadata'->>'createdDate')::timestamp > $2
Other slow queries:
duration: 606.358 ms statement: SELECT fs09000000_mod_inventory_storage.count_estimate('SELECT jsonb,id FROM fs09000000_mod_inventory_storage.item WHERE lower(f_unaccent(item.jsonb->>''barcode'')) LIKE lower(f_unaccent(''13137881''))')
Searching item for barcode=="13137881"
already has a correct b-tree index (schema.json) since March 2019 (MODINVSTOR-269). If it takes 606 ms this is probably caused by database server overload. We may get a slight speedup by changing the b-tree index into a unique b-tree index (MODINVSTOR-523).
Miscellaneous
- Raw test data: